Question

Biostrings regex matching

1

Entering edit mode

Aditya ▴ 160

@aditya-7667

Last seen 22 months ago

Germany

How to do Biostrings regex matching?

chr1 <- BSgenome.Mmusculus.UCSC.mm10::Mmusculus$chr1

Biostrings::countPattern('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', chr1)
    [1] 363

Biostrings::countPattern('A{44}', chr1, fixed = FALSE)
    Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  :
      key 123 (char '{') not in lookup table
    Error in normargPattern(pattern, subject) :
      could not turn 'pattern' into a DNAString instance

Biostrings • 2.0k views

ADD COMMENT • link updated 4.9 years ago by Hervé Pagès 16k • written 4.9 years ago by Aditya ▴ 160

score 3 · Answer 1 · 2019-06-11

Hi Aditya,

matchPattern() and family in Biostrings don't support the regex syntax. You would have to use grep() for that:

library(Biostrings)
subject <- DNAStringSet(c("TTATATT", "CCCAACCCAAACCCAAAAAAT"))
grep("A{3}", subject)
# [1] 2

or regexpr() or gregexpr(), depending on what you are after:

regexpr("A{3}", subject)
# [1] -1  9
# attr(,"match.length")
# [1] -1  3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE

gregexpr("A{3}", subject)
# [[1]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# [1]  9 15 18
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE

However grep() and family won't be as efficient as matchPattern() and family on a DNAStringSet or DNAString object. This was actually the original motivation for coming up with the matchPattern family of string matching functions in Biostrings.

FWIW note that this family supports some limited form of fuzzy matching via the use of IIUPAC ambiguity letters in the pattern and/or subject. It also supports a small number of mismatches and indels via the max.mismatch, min.mismatch, and with.indels arguments. See ?matchPattern for the details.

Finally note that the grep("A{n}", subject) use case can easily be handled without using regex at all. For example:

matchPattern(strrep("A", 3), subject[[2]])
#   Views on a 21-letter DNAString subject
# subject: CCCAACCCAAACCCAAAAAAT
# views:
#     start end width
# [1]     9  11     3 [AAA]
# [2]    15  17     3 [AAA]
# [3]    16  18     3 [AAA]
# [4]    17  19     3 [AAA]
# [5]    18  20     3 [AAA]

Not only will this be much more efficient than using grep() and family on long DNA sequences but, as you can see, unlike with a regex, it also returns all the matches. This was another original motivation for coming up with the matchPattern() family of string matching functions. And of course, you can still combine this with the use of fuzzy matching if you need that. For example, allowing 1 nucleotide insertion or deletion:

matchPattern(strrep("A", 3), subject[[1]], max.mismatch=1, with.indels=TRUE)
#   Views on a 7-letter DNAString subject
# subject: TTATATT
# views:
#     start end width
# [1]     3   5     3 [ATA]

matchPattern(strrep("A", 5), "TTAAAATT", max.mismatch=1, with.indels=TRUE)
#   Views on a 8-letter BString subject
# subject: TTAAAATT
# views:
#     start end width
# [1]     3   6     4 [AAAA]

matchPattern(strrep("A", 6), "TTAATAATT", max.mismatch=2, with.indels=TRUE)
#   Views on a 9-letter BString subject
# subject: TTAATAATT
# views:
#     start end width
# [1]     3   7     5 [AATAA]

Hope this helps,

H.