Question: What is the use of a masked BSgenome?
0
4 weeks ago by
Germany

BSgenome.Mmusculus.UCSC.mm10 and BSgenome.Mmusculus.UCSC.mm10.masked contain the same information, with unknown nucleotides coded as either 'N' or '#':

    bsgenome <- BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10

BSgenome::alphabetFrequency(bsgenome$chr1) BSgenome::alphabetFrequency(maskedbs$chr1)


So it comes as no surprise that matching BSgenome.Mmusculus.UCSC.mm10 gives identical results as matching BSgenome.Mmusculus.UCSC.mm10.masked:

   cas9seq <- "CTTATATTGTCTCCAGCAGAAGG"
Biostrings::countPattern(cas9seq, bsgenome$chr1) # 59 Biostrings::countPattern(cas9seq, maskedbs$chr1)    # 59


The matching performance also seems to be very similar:

benchmark <- microbenchmark::microbenchmark
benchmark(Biostrings::countPattern(cas9seq, bsgenome$chr1), times = 50L) benchmark(Biostrings::countPattern(cas9seq, maskedbs$chr1), times = 50L)
benchmark(Biostrings::vcountPDict(cas9seq, bsgenome, min.mismatch = 0, max.mismatch = 0), times = 1)
benchmark(Biostrings::vcountPDict(cas9seq, maskedbs, min.mismatch = 0, max.mismatch = 0), times = 1)


Which makes one wonder: what is actually the use of a masked BSgenome? :-D

modified 10 days ago by shepherl ♦♦ 1.5k • written 4 weeks ago by Aditya120
1
10 days ago by
shepherl ♦♦ 1.5k
United States
shepherl ♦♦ 1.5k wrote:

The difference between masked and the other genomes are described in the DESCRIPTION of the "masked" package. Examples:

Taken from the BSgenome.Hsapiens.UCSC.hg19.masked landing page

Or taken from the BSgenome.Mmusculus.UCSC.mm10.masked landing page

The sequences are the same as in BSgenome.Mmusculus.UCSC.mm10, except that each of them has the 2 following masks on top: (1) the mask of assembly gaps (AGAPS mask), and (2) the mask of intra-contig ambiguities (AMB mask).

1

It is not true in general that a "naked" BSgenome object and its corresponding "masked" BSgenome object contain the same information, with unknown nucleotides encoded as either N or #. That would be true if the latter only contained the mask of assembly gaps (AGAPS masks) and intra-contig ambiguities (AMB masks), which are indeed encoded as N in the "naked" version. (To be precise, in some rare occasions, the intra-contig ambiguities are encoded with other IUPAC ambiguity letters, e.g. BSgenome.Hsapiens.UCSC.hg18 has some R's and M's on chr3.)

However most "masked" BSgenome objects also contain the RepeatMasker (RM) and Tandem Repeats Finder (TRF) masks which are NOT encoded as N in the "naked" version.

One important bit here is that masks can be activated or deactivated at the chromosome level. For example to activate the RepeatMasker (RM) mask on Human chr1, do something like:

library(BSgenome.Hsapiens.UCSC.hg19.masked)
chr1 <- genome\$chr1


Note that only the mask of assembly gaps (AGAPS) and intra-contig ambiguities (AMB) are active by default. So by default, yes, matching gives identical results on the "naked" and "masked" BSgenome objects but not if you activate the other masks.

The BSgenome.Mmusculus.UCSC.mm10.masked case is special because this "masked" BSgenome only contains the AGAPS and AMB masks. Furthermore, the AMB masks are empty (in other words, the assembled sequences don't contain ambiguity letters inside the contigs), so could in theory be removed but that's another story. In this particular case, the "naked" and "masked" objects contain the same information. A minor advantage of using the latter is speed, even though, as you noticed, the speed gain is somewhat limited. Another minor advantage is the possibility to do fuzzy matching (by specifying fixed=FALSE) without getting zillions of meaningless matches to the AGAPS regions (i.e. inter-contig) that you would get if you were using the "naked" BSgenome. Although this could also be prevented by using fixed="subject" on the "naked" BSgenome but this means that you know about fixed="subject". This is starting to be pretty advanced stuff so I should probably stop here.

Hope this helps,

H.