Question

calculate frequency of non-overlapping matches for a motif in a XStringSet

0

Entering edit mode

Stefanie Tauber ▴ 40

@stefanie-tauber-3978

Last seen 2.1 years ago

Germany

Hi,

I would like to count the number of disjoint matches for a given pattern and character vector where matches are sought. So, basically what gregexpr is doing. However I would like to do this for a set of patterns as well as a set of characters (such as XStringSet).

Right now I just apply gregexpr multiple times. Is there a more clever way to do this? I know about vcountPDict. However, I would like to count number of disjoint occurrences rather than total number of occurrences.

Any comment on this?

Toy example

library(Biostrings)

data(yeastSEQCHR1)
yeast1 <- DNAString(yeastSEQCHR1)

x = Views(yeast1, start = sample(length(yeast1),20), width=20)

# returns number of disjoint matches for a given pattern
gregexpr("AAA",x)

# vectorized way to get the number of matches for each motif in the dictionary
vcountPDict(DNAStringSet(c("AAA","TTT")), x)

Best,

Stefanie

Biostrings patternmatch biostrings matchpattern • 1.4k views

ADD COMMENT • link updated 8.2 years ago by Hervé Pagès 16k • written 8.2 years ago by Stefanie Tauber ▴ 40

score 1 · Accepted Answer · 2016-02-01

Hi Stephanie,

If you were able to use vmatchPDict() (instead of vcountPDict()) then you could get the ranges of the matches (instead of the counts only), then merge the ranges that overlap, and then finally count them. Unfortunately vmatchPDict() is not available yet. However you might still be able to use a loop and get reasonable performance:

If you don't have too many patterns (i.e. < 10000), you can loop over them and call vmatchPattern() in the loop:

my_patterns <- DNAStringSet(c("AAA","TTT"))
m1 <- sapply(my_patterns, function(p) elementLengths(as(vmatchPattern(p, as(x, "DNAStringSet")), "NormalIRangesList")))
m1
#       [,1] [,2]
#  [1,]    0    1
#  [2,]    1    0
#  [3,]    1    1
#  [4,]    1    0
#  [5,]    0    0
#  [6,]    0    0
#  [7,]    0    2
#  [8,]    3    0
#  [9,]    1    0
# [10,]    0    1
# [11,]    0    0
# [12,]    0    0
# [13,]    0    0
# [14,]    1    0
# [15,]    0    1
# [16,]    0    0
# [17,]    0    0
# [18,]    0    0
# [19,]    1    0
# [20,]    0    1

The coercion to NormalIRangesList is what performs the merging of overlapping ranges.

If you have many patterns but a small number of subjects, it might be more efficient to loop over the subjects and call matchPDict() in the loop:

m2 <- sapply(as(x, "DNAStringSet"), function(s) elementLengths(as(matchPDict(PDict(my_patterns), s), "NormalIRangesList")))
m2
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
# [1,]    0    1    1    1    0    0    0    3    1     0     0     0     0     1
# [2,]    1    0    1    0    0    0    2    0    0     1     0     0     0     0
#      [,15] [,16] [,17] [,18] [,19] [,20]
# [1,]     0     0     0     0     1     0
# [2,]     1     0     0     0     0     1

Note that m2 is the transpose of m1:

identical(m2, t(m1)) # [1] TRUE

Hope this helps,

H.