Question: What is the use of a masked BSgenome?
0
gravatar for Aditya
4 weeks ago by
Aditya120
Germany
Aditya120 wrote:

BSgenome.Mmusculus.UCSC.mm10 and BSgenome.Mmusculus.UCSC.mm10.masked contain the same information, with unknown nucleotides coded as either 'N' or '#':

    bsgenome <- BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10
    maskedbs <- BSgenome.Mmusculus.UCSC.mm10.masked::BSgenome.Mmusculus.UCSC.mm10.masked

    BSgenome::alphabetFrequency(bsgenome$chr1)
    BSgenome::alphabetFrequency(maskedbs$chr1)

So it comes as no surprise that matching BSgenome.Mmusculus.UCSC.mm10 gives identical results as matching BSgenome.Mmusculus.UCSC.mm10.masked:

   cas9seq <- "CTTATATTGTCTCCAGCAGAAGG" 
   Biostrings::countPattern(cas9seq, bsgenome$chr1)    # 59
   Biostrings::countPattern(cas9seq, maskedbs$chr1)    # 59

The matching performance also seems to be very similar:

benchmark <- microbenchmark::microbenchmark
benchmark(Biostrings::countPattern(cas9seq, bsgenome$chr1), times = 50L)
benchmark(Biostrings::countPattern(cas9seq, maskedbs$chr1), times = 50L)
benchmark(Biostrings::vcountPDict(cas9seq, bsgenome, min.mismatch = 0, max.mismatch = 0), times = 1)
benchmark(Biostrings::vcountPDict(cas9seq, maskedbs, min.mismatch = 0, max.mismatch = 0), times = 1)

Which makes one wonder: what is actually the use of a masked BSgenome? :-D

ADD COMMENTlink modified 10 days ago by shepherl ♦♦ 1.5k • written 4 weeks ago by Aditya120
Answer: What is the use of a masked BSgenome?
1
gravatar for shepherl
10 days ago by
shepherl ♦♦ 1.5k
United States
shepherl ♦♦ 1.5k wrote:

The difference between masked and the other genomes are described in the DESCRIPTION of the "masked" package. Examples:

Taken from the BSgenome.Hsapiens.UCSC.hg19.masked landing page

The sequences are the same as in BSgenome.Hsapiens.UCSC.hg38, except that each of them has the 4 following masks on top: (1) the mask of assembly gaps (AGAPS mask), (2) the mask of intra-contig ambiguities (AMB mask), (3) the mask of repeats from RepeatMasker (RM mask), and (4) the mask of repeats from Tandem Repeats Finder (TRF mask). Only the AGAPS and AMB masks are "active" by default.

Or taken from the BSgenome.Mmusculus.UCSC.mm10.masked landing page

The sequences are the same as in BSgenome.Mmusculus.UCSC.mm10, except that each of them has the 2 following masks on top: (1) the mask of assembly gaps (AGAPS mask), and (2) the mask of intra-contig ambiguities (AMB mask).

ADD COMMENTlink written 10 days ago by shepherl ♦♦ 1.5k
1

It is not true in general that a "naked" BSgenome object and its corresponding "masked" BSgenome object contain the same information, with unknown nucleotides encoded as either N or #. That would be true if the latter only contained the mask of assembly gaps (AGAPS masks) and intra-contig ambiguities (AMB masks), which are indeed encoded as N in the "naked" version. (To be precise, in some rare occasions, the intra-contig ambiguities are encoded with other IUPAC ambiguity letters, e.g. BSgenome.Hsapiens.UCSC.hg18 has some R's and M's on chr3.)

However most "masked" BSgenome objects also contain the RepeatMasker (RM) and Tandem Repeats Finder (TRF) masks which are NOT encoded as N in the "naked" version.

One important bit here is that masks can be activated or deactivated at the chromosome level. For example to activate the RepeatMasker (RM) mask on Human chr1, do something like:

library(BSgenome.Hsapiens.UCSC.hg19.masked)
genome <- BSgenome.Hsapiens.UCSC.hg19.masked
chr1 <- genome$chr1
masks(chr1)  # see all the masks
active(masks(chr1))["RM"] <- TRUE  # activate the RM mask

Note that only the mask of assembly gaps (AGAPS) and intra-contig ambiguities (AMB) are active by default. So by default, yes, matching gives identical results on the "naked" and "masked" BSgenome objects but not if you activate the other masks.

The BSgenome.Mmusculus.UCSC.mm10.masked case is special because this "masked" BSgenome only contains the AGAPS and AMB masks. Furthermore, the AMB masks are empty (in other words, the assembled sequences don't contain ambiguity letters inside the contigs), so could in theory be removed but that's another story. In this particular case, the "naked" and "masked" objects contain the same information. A minor advantage of using the latter is speed, even though, as you noticed, the speed gain is somewhat limited. Another minor advantage is the possibility to do fuzzy matching (by specifying fixed=FALSE) without getting zillions of meaningless matches to the AGAPS regions (i.e. inter-contig) that you would get if you were using the "naked" BSgenome. Although this could also be prevented by using fixed="subject" on the "naked" BSgenome but this means that you know about fixed="subject". This is starting to be pretty advanced stuff so I should probably stop here.

Hope this helps,

H.

ADD REPLYlink modified 5 days ago • written 5 days ago by Hervé Pagès ♦♦ 14k

Thank you Herve. As with your other posts, very illuminating.

ADD REPLYlink written 5 days ago by Aditya120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 286 users visited in the last hour