Question

What is the use of a masked BSgenome?

0

Entering edit mode

Aditya ▴ 160

@aditya-7667

Last seen 22 months ago

Germany

BSgenome.Mmusculus.UCSC.mm10 and BSgenome.Mmusculus.UCSC.mm10.masked contain the same information, with unknown nucleotides coded as either 'N' or '#':

    bsgenome <- BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10
    maskedbs <- BSgenome.Mmusculus.UCSC.mm10.masked::BSgenome.Mmusculus.UCSC.mm10.masked

    BSgenome::alphabetFrequency(bsgenome$chr1)
    BSgenome::alphabetFrequency(maskedbs$chr1)

So it comes as no surprise that matching BSgenome.Mmusculus.UCSC.mm10 gives identical results as matching BSgenome.Mmusculus.UCSC.mm10.masked:

   cas9seq <- "CTTATATTGTCTCCAGCAGAAGG" 
   Biostrings::countPattern(cas9seq, bsgenome$chr1)    # 59
   Biostrings::countPattern(cas9seq, maskedbs$chr1)    # 59

The matching performance also seems to be very similar:

benchmark <- microbenchmark::microbenchmark
benchmark(Biostrings::countPattern(cas9seq, bsgenome$chr1), times = 50L)
benchmark(Biostrings::countPattern(cas9seq, maskedbs$chr1), times = 50L)
benchmark(Biostrings::vcountPDict(cas9seq, bsgenome, min.mismatch = 0, max.mismatch = 0), times = 1)
benchmark(Biostrings::vcountPDict(cas9seq, maskedbs, min.mismatch = 0, max.mismatch = 0), times = 1)

Which makes one wonder: what is actually the use of a masked BSgenome? :-D

BSgenome BSgenome.Mmusculus.UCSC.mm10::Mmusculus Biostrings • 1.2k views

ADD COMMENT • link updated 4.6 years ago by shepherl 3.8k • written 4.7 years ago by Aditya ▴ 160

score 1 · Answer 1 · 2019-09-06

1

Entering edit mode

shepherl 3.8k

@lshep

Last seen 32 minutes ago

United States

The difference between masked and the other genomes are described in the DESCRIPTION of the "masked" package. Examples:

Taken from the BSgenome.Hsapiens.UCSC.hg19.masked landing page

The sequences are the same as in BSgenome.Hsapiens.UCSC.hg38, except that each of them has the 4 following masks on top: (1) the mask of assembly gaps (AGAPS mask), (2) the mask of intra-contig ambiguities (AMB mask), (3) the mask of repeats from RepeatMasker (RM mask), and (4) the mask of repeats from Tandem Repeats Finder (TRF mask). Only the AGAPS and AMB masks are "active" by default.

Or taken from the BSgenome.Mmusculus.UCSC.mm10.masked landing page

The sequences are the same as in BSgenome.Mmusculus.UCSC.mm10, except that each of them has the 2 following masks on top: (1) the mask of assembly gaps (AGAPS mask), and (2) the mask of intra-contig ambiguities (AMB mask).

ADD COMMENT • link 4.6 years ago shepherl 3.8k

1

Entering edit mode

It is not true in general that a "naked" BSgenome object and its corresponding "masked" BSgenome object contain the same information, with unknown nucleotides encoded as either N or #. That would be true if the latter only contained the mask of assembly gaps (AGAPS masks) and intra-contig ambiguities (AMB masks), which are indeed encoded as N in the "naked" version. (To be precise, in some rare occasions, the intra-contig ambiguities are encoded with other IUPAC ambiguity letters, e.g. BSgenome.Hsapiens.UCSC.hg18 has some R's and M's on chr3.)

However most "masked" BSgenome objects also contain the RepeatMasker (RM) and Tandem Repeats Finder (TRF) masks which are NOT encoded as N in the "naked" version.

One important bit here is that masks can be activated or deactivated at the chromosome level. For example to activate the RepeatMasker (RM) mask on Human chr1, do something like:

library(BSgenome.Hsapiens.UCSC.hg19.masked)
genome <- BSgenome.Hsapiens.UCSC.hg19.masked
chr1 <- genome$chr1
masks(chr1)  # see all the masks
active(masks(chr1))["RM"] <- TRUE  # activate the RM mask

Note that only the mask of assembly gaps (AGAPS) and intra-contig ambiguities (AMB) are active by default. So by default, yes, matching gives identical results on the "naked" and "masked" BSgenome objects but not if you activate the other masks.

The BSgenome.Mmusculus.UCSC.mm10.masked case is special because this "masked" BSgenome only contains the AGAPS and AMB masks. Furthermore, the AMB masks are empty (in other words, the assembled sequences don't contain ambiguity letters inside the contigs), so could in theory be removed but that's another story. In this particular case, the "naked" and "masked" objects contain the same information. A minor advantage of using the latter is speed, even though, as you noticed, the speed gain is somewhat limited. Another minor advantage is the possibility to do fuzzy matching (by specifying fixed=FALSE) without getting zillions of meaningless matches to the AGAPS regions (i.e. inter-contig) that you would get if you were using the "naked" BSgenome. Although this could also be prevented by using fixed="subject" on the "naked" BSgenome but this means that you know about fixed="subject". This is starting to be pretty advanced stuff so I should probably stop here.

Hope this helps,

H.

ADD REPLY • link 4.6 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you Herve. As with your other posts, very illuminating.

ADD REPLY • link 4.6 years ago Aditya ▴ 160