Question: Biostrings regex matching
1
gravatar for Aditya
4 months ago by
Aditya120
Germany
Aditya120 wrote:

How to do Biostrings regex matching?

chr1 <- BSgenome.Mmusculus.UCSC.mm10::Mmusculus$chr1

Biostrings::countPattern('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', chr1)
    [1] 363

Biostrings::countPattern('A{44}', chr1, fixed = FALSE)
    Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  :
      key 123 (char '{') not in lookup table
    Error in normargPattern(pattern, subject) :
      could not turn 'pattern' into a DNAString instance
biostrings • 166 views
ADD COMMENTlink modified 4 months ago by Hervé Pagès ♦♦ 14k • written 4 months ago by Aditya120
Answer: Biostrings regex matching
3
gravatar for Hervé Pagès
4 months ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Aditya,

matchPattern() and family in Biostrings don't support the regex syntax. You would have to use grep() for that:

library(Biostrings)
subject <- DNAStringSet(c("TTATATT", "CCCAACCCAAACCCAAAAAAT"))
grep("A{3}", subject)
# [1] 2

or regexpr() or gregexpr(), depending on what you are after:

regexpr("A{3}", subject)
# [1] -1  9
# attr(,"match.length")
# [1] -1  3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE

gregexpr("A{3}", subject)
# [[1]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# [1]  9 15 18
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE

However grep() and family won't be as efficient as matchPattern() and family on a DNAStringSet or DNAString object. This was actually the original motivation for coming up with the matchPattern family of string matching functions in Biostrings.

FWIW note that this family supports some limited form of fuzzy matching via the use of IIUPAC ambiguity letters in the pattern and/or subject. It also supports a small number of mismatches and indels via the max.mismatch, min.mismatch, and with.indels arguments. See ?matchPattern for the details.

Finally note that the grep("A{n}", subject) use case can easily be handled without using regex at all. For example:

matchPattern(strrep("A", 3), subject[[2]])
#   Views on a 21-letter DNAString subject
# subject: CCCAACCCAAACCCAAAAAAT
# views:
#     start end width
# [1]     9  11     3 [AAA]
# [2]    15  17     3 [AAA]
# [3]    16  18     3 [AAA]
# [4]    17  19     3 [AAA]
# [5]    18  20     3 [AAA]

Not only will this be much more efficient than using grep() and family on long DNA sequences but, as you can see, unlike with a regex, it also returns all the matches. This was another original motivation for coming up with the matchPattern() family of string matching functions. And of course, you can still combine this with the use of fuzzy matching if you need that. For example, allowing 1 nucleotide insertion or deletion:

matchPattern(strrep("A", 3), subject[[1]], max.mismatch=1, with.indels=TRUE)
#   Views on a 7-letter DNAString subject
# subject: TTATATT
# views:
#     start end width
# [1]     3   5     3 [ATA]

matchPattern(strrep("A", 5), "TTAAAATT", max.mismatch=1, with.indels=TRUE)
#   Views on a 8-letter BString subject
# subject: TTAAAATT
# views:
#     start end width
# [1]     3   6     4 [AAAA]

matchPattern(strrep("A", 6), "TTAATAATT", max.mismatch=2, with.indels=TRUE)
#   Views on a 9-letter BString subject
# subject: TTAATAATT
# views:
#     start end width
# [1]     3   7     5 [AATAA]

Hope this helps,

H.

ADD COMMENTlink modified 3 months ago • written 4 months ago by Hervé Pagès ♦♦ 14k

Thank you Herve :-).

ADD REPLYlink written 4 months ago by Aditya120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 468 users visited in the last hour