I'm trying to find the instances of a degenerate DNA sequence (contains N's, R's, K's, etc.) in the human genome. I am using the matchPattern function provided in Biostrings. However, when I use matchPattern(pattern, subject, fixed=FALSE) in order to force the interpretation of the IUPAC extended letters as ambiguities, it returns a lot of sequences that are all N's since the beginning and end of the sequenced chromosomes in the human genome contains thousands of N's. Is there any way to ignore those regions, ignore patterns that are all N's, or trim the chromosomes to remove all the N's? Thank you very much.
Masked genomes worked great. Thank you very much for pointing that out.