Question: Biostrings regex matching
1
gravatar for Aditya
19 days ago by
Aditya70
Germany
Aditya70 wrote:

How to do Biostrings regex matching?

chr1 <- BSgenome.Mmusculus.UCSC.mm10::Mmusculus$chr1

Biostrings::countPattern('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', chr1)
    [1] 363

Biostrings::countPattern('A{44}', chr1, fixed = FALSE)
    Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  :
      key 123 (char '{') not in lookup table
    Error in normargPattern(pattern, subject) :
      could not turn 'pattern' into a DNAString instance
biostrings • 71 views
ADD COMMENTlink modified 13 days ago by Hervé Pagès ♦♦ 14k • written 19 days ago by Aditya70
Answer: Biostrings regex matching
3
gravatar for Hervé Pagès
13 days ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Aditya,

matchPattern() and family in Biostrings don't support the regex syntax. You would have to use grep() for that:

library(Biostrings)
subject <- DNAStringSet(c("TTATATT", "CCCAACCCAAACCCAAAAAAT"))
grep("A{3}", subject)
# [1] 2

or regexpr() or gregexpr(), depending on what you are after:

regexpr("A{3}", subject)
# [1] -1  9
# attr(,"match.length")
# [1] -1  3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE

gregexpr("A{3}", subject)
# [[1]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# [1]  9 15 18
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE

However grep() and family won't be as efficient as matchPattern() and family on a DNAStringSet or DNAString object. This was actually the original motivation for coming up with the matchPattern family of string matching functions in Biostrings.

FWIW note that this family supports some limited form of fuzzy matching via the use of IIUPAC ambiguity letters in the pattern and/or subject. It also supports a small number of mismatches and indels via the max.mismatch, min.mismatch, and with.indels arguments. See ?matchPattern for the details.

Finally note that the grep("A{n}", subject) use case can easily be handled without using regex at all. For example:

matchPattern(strrep("A", 3), subject[[2]])
#   Views on a 21-letter DNAString subject
# subject: CCCAACCCAAACCCAAAAAAT
# views:
#     start end width
# [1]     9  11     3 [AAA]
# [2]    15  17     3 [AAA]
# [3]    16  18     3 [AAA]
# [4]    17  19     3 [AAA]
# [5]    18  20     3 [AAA]

Not only will this be much more efficient than using grep() and family on long DNA sequences but, as you can see, unlike with a regex, it also returns all the matches. This was another original motivation for coming up with the matchPattern() family of string matching functions. And of course, you can still combine this with the use of fuzzy matching if you need that. For example, allowing 1 nucleotide insertion or deletion:

matchPattern(strrep("A", 3), subject[[1]], max.mismatch=1, with.indels=TRUE)
#   Views on a 7-letter DNAString subject
# subject: TTATATT
# views:
#     start end width
# [1]     3   5     3 [ATA]

Hope this helps,

H.

ADD COMMENTlink modified 11 days ago • written 13 days ago by Hervé Pagès ♦♦ 14k

Thank you Herve :-).

ADD REPLYlink written 12 days ago by Aditya70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 246 users visited in the last hour