Question: calculate frequency of non-overlapping matches for a motif in a XStringSet
0
3.7 years ago by
Germany
Stefanie Tauber40 wrote:

Hi,

I would like to count the number of disjoint matches for a given pattern and character vector where matches are sought. So, basically what gregexpr is doing. However I would like to do this for a set of patterns as well as a set of characters (such as XStringSet).

Right now I just apply gregexpr multiple times. Is there a more clever way to do this? I know about vcountPDict. However, I would like to count number of disjoint occurrences rather than total number of occurrences.

Any comment on this?

Toy example

library(Biostrings)

data(yeastSEQCHR1)
yeast1 <- DNAString(yeastSEQCHR1)

x = Views(yeast1, start = sample(length(yeast1),20), width=20)

# returns number of disjoint matches for a given pattern
gregexpr("AAA",x)

# vectorized way to get the number of matches for each motif in the dictionary
vcountPDict(DNAStringSet(c("AAA","TTT")), x)

Best,

Stefanie

modified 3.7 years ago by Hervé Pagès ♦♦ 14k • written 3.7 years ago by Stefanie Tauber40
Answer: calculate frequency of non-overlapping matches for a motif in a XStringSet
1
3.7 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Stephanie,

If you were able to use vmatchPDict() (instead of vcountPDict()) then you could get the ranges of the matches (instead of the counts only), then merge the ranges that overlap, and then finally count them. Unfortunately vmatchPDict() is not available yet. However you might still be able to use a loop and get reasonable performance:

If you don't have too many patterns (i.e. < 10000), you can loop over them and call vmatchPattern() in the loop:

my_patterns <- DNAStringSet(c("AAA","TTT"))
m1 <- sapply(my_patterns, function(p) elementLengths(as(vmatchPattern(p, as(x, "DNAStringSet")), "NormalIRangesList")))
m1
#       [,1] [,2]
#  [1,]    0    1
#  [2,]    1    0
#  [3,]    1    1
#  [4,]    1    0
#  [5,]    0    0
#  [6,]    0    0
#  [7,]    0    2
#  [8,]    3    0
#  [9,]    1    0
# [10,]    0    1
# [11,]    0    0
# [12,]    0    0
# [13,]    0    0
# [14,]    1    0
# [15,]    0    1
# [16,]    0    0
# [17,]    0    0
# [18,]    0    0
# [19,]    1    0
# [20,]    0    1

The coercion to NormalIRangesList is what performs the merging of overlapping ranges.

If you have many patterns but a small number of subjects, it might be more efficient to loop over the subjects and call matchPDict() in the loop:

m2 <- sapply(as(x, "DNAStringSet"), function(s) elementLengths(as(matchPDict(PDict(my_patterns), s), "NormalIRangesList")))
m2
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
# [1,]    0    1    1    1    0    0    0    3    1     0     0     0     0     1
# [2,]    1    0    1    0    0    0    2    0    0     1     0     0     0     0
#      [,15] [,16] [,17] [,18] [,19] [,20]
# [1,]     0     0     0     0     1     0
# [2,]     1     0     0     0     0     1

Note that m2 is the transpose of m1:

identical(m2, t(m1)) # [1] TRUE

Hope this helps,

H.

Hi Herve,

thanks for the comment. This definitely helps. Since I am interested in non-overlapping hits, I would calculate - after the conversion to 'NormalIRangesList' the number of non-overlapping hits by means of the width of the matches and the width of the pattern.
In my case, I have many ( about 800,000 ) subjects and few patterns (~ 100). Is there any possibility to speed things up?
The conversion to 'NormalIRangesList' takes several minutes for each single pattern ...
Thanks a lot!

Hi Stefanie,

Coercion from MIndex to NormalIRangesList was indeed very inefficient. I fixed this in Biostrings 2.38.4. Now it's much faster. Biostrings 2.38.4 should become available via biocLite() in about 36 hours.

Let me know if you still run into problems with this.

Cheers,

H.