Hi Aditya,
matchPattern()
and family in Biostrings don't support the regex syntax. You would have to use grep()
for that:
library(Biostrings)
subject <- DNAStringSet(c("TTATATT", "CCCAACCCAAACCCAAAAAAT"))
grep("A{3}", subject)
# [1] 2
or regexpr()
or gregexpr()
, depending on what you are after:
regexpr("A{3}", subject)
# [1] -1 9
# attr(,"match.length")
# [1] -1 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
gregexpr("A{3}", subject)
# [[1]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# [1] 9 15 18
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
However grep()
and family won't be as efficient as matchPattern()
and family on a DNAStringSet or DNAString object. This was actually the original motivation for coming up with the matchPattern
family of string matching functions in Biostrings.
FWIW note that this family supports some limited form of fuzzy matching via the use of IIUPAC ambiguity letters in the pattern and/or subject. It also supports a small number of mismatches and indels via the max.mismatch
, min.mismatch
, and with.indels
arguments. See ?matchPattern
for the details.
Finally note that the grep("A{n}", subject)
use case can easily be handled without using regex at all. For example:
matchPattern(strrep("A", 3), subject[[2]])
# Views on a 21-letter DNAString subject
# subject: CCCAACCCAAACCCAAAAAAT
# views:
# start end width
# [1] 9 11 3 [AAA]
# [2] 15 17 3 [AAA]
# [3] 16 18 3 [AAA]
# [4] 17 19 3 [AAA]
# [5] 18 20 3 [AAA]
Not only will this be much more efficient than using grep()
and family on long DNA sequences but, as you can see, unlike with a regex, it also returns all the matches. This was another original motivation for coming up with the matchPattern()
family of string matching functions. And of course, you can still combine this with the use of fuzzy matching if you need that. For example, allowing 1 nucleotide insertion or deletion:
matchPattern(strrep("A", 3), subject[[1]], max.mismatch=1, with.indels=TRUE)
# Views on a 7-letter DNAString subject
# subject: TTATATT
# views:
# start end width
# [1] 3 5 3 [ATA]
matchPattern(strrep("A", 5), "TTAAAATT", max.mismatch=1, with.indels=TRUE)
# Views on a 8-letter BString subject
# subject: TTAAAATT
# views:
# start end width
# [1] 3 6 4 [AAAA]
matchPattern(strrep("A", 6), "TTAATAATT", max.mismatch=2, with.indels=TRUE)
# Views on a 9-letter BString subject
# subject: TTAATAATT
# views:
# start end width
# [1] 3 7 5 [AATAA]
Hope this helps,
H.
Thank you Herve :-).