Biostrings matchPattern with lower case
1
0
Entering edit mode
@storybenjamin-11722
Last seen 15 months ago
Germany

Hi,

Is it possible to match specifically lower case nucleotides (e.g. agct). When genomes are repeat-masked they can be soft-masked which results in lower case regions - which might in certain cases be of interest vs non-masked regions.

Example:

>random
AGAGTAGTagtAGT

Can Biostrings account for this or is everything automatically converted to upper case under the hood for convenience?

biostrings • 1.2k views
ADD COMMENT
3
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States

DNAString and DNAStringSet objects in Biostrings don't keep track of the case.

Note that we provide "masked genomes" for some organisms (e.g. BSgenome.Hsapiens.UCSC.hg38.masked) where the chromosome sequences have various masks on them (e.g. RepeatMasker mask, but not only). You can use that if you need string matching tools like matchPattern() to ignore the masked regions.

Another approach is to use BString/BStringSet objects instead of DNAString/DNAStringSet objects. Unlike the latter, the former preserve the case. (The BStringSet container is the general purpose string container in Biostrings so is analog to an ordinary character vector in base R.) Note that some matchPattern functionalities specific to DNAString/DNAStringSet objects won't work with BString/BStringSet objects (e.g. fixed=FALSE).

Hope this helps.

H.

ADD COMMENT

Login before adding your answer.

Traffic: 434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6