Question

Ignoring N's on Ends of the Chromosome When Using matchPattern

2

Entering edit mode

msmithmailbox ▴ 20

@msmithmailbox-8170

Last seen 8.9 years ago

United States

I'm trying to find the instances of a degenerate DNA sequence (contains N's, R's, K's, etc.) in the human genome. I am using the matchPattern function provided in Biostrings. However, when I use matchPattern(pattern, subject, fixed=FALSE) in order to force the interpretation of the IUPAC extended letters as ambiguities, it returns a lot of sequences that are all N's since the beginning and end of the sequenced chromosomes in the human genome contains thousands of N's. Is there any way to ignore those regions, ignore patterns that are all N's, or trim the chromosomes to remove all the N's? Thank you very much.

bsgenome matchPattern • 1.6k views

ADD COMMENT • link updated 8.9 years ago by Hervé Pagès 16k • written 8.9 years ago by msmithmailbox ▴ 20

score 2 · Answer 1 · 2015-06-15

Hi,

Use fixed="subject" if you want to treat the IUPAC extended letters present in the pattern as ambiguities, but not those present in the subject.

Alternatively, you could use a masked genome (e.g. BSgenome.Hsapiens.UCSC.hg19.masked). In a masked genome the chromosome sequences are the same as in a non-masked genome but masks have been added on top of them to mask different kinds of regions. For example in BSgenome.Hsapiens.UCSC.hg19.masked each sequence has 4 masks and the 1st one is the "AGAPS mask" which masks the assembly gaps (i.e. it masks the inter-contig regions made of N's only). See ?BSgenome.Hsapiens.UCSC.hg19.masked for more information. matchPattern() and most other string matching tools in Biostrings ignore the masked regions.

Cheers,

H.