Ignoring N's on Ends of the Chromosome When Using matchPattern
1
2
Entering edit mode
@msmithmailbox-8170
Last seen 9.4 years ago
United States

I'm trying to find the instances of a degenerate DNA sequence (contains N's, R's, K's, etc.) in the human genome. I am using the matchPattern function provided in Biostrings. However, when I use matchPattern(pattern, subject, fixed=FALSE) in order to force the interpretation of  the IUPAC extended letters as ambiguities, it returns a lot of sequences that are all N's since the beginning and end of the sequenced chromosomes in the human genome contains thousands of N's. Is there any way to ignore those regions, ignore patterns that are all N's, or trim the chromosomes to remove all the N's? Thank you very much.

bsgenome matchPattern • 1.8k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

Use fixed="subject" if you want to treat the IUPAC extended letters present in the pattern as ambiguities, but not those present in the subject.

Alternatively, you could use a masked genome (e.g. BSgenome.Hsapiens.UCSC.hg19.masked). In a masked genome the chromosome sequences are the same as in a non-masked genome but masks have been added on top of them to mask different kinds of regions. For example in BSgenome.Hsapiens.UCSC.hg19.masked each sequence has 4 masks and the 1st one is the "AGAPS mask" which masks the assembly gaps (i.e. it masks the inter-contig regions made of N's only). See ?BSgenome.Hsapiens.UCSC.hg19.masked for more information. matchPattern() and most other string matching tools in Biostrings ignore the masked regions.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Masked genomes worked great. Thank you very much for pointing that out.

ADD REPLY

Login before adding your answer.

Traffic: 474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6