Hi,
I would like to count the number of disjoint matches for a given pattern and character vector where matches are sought. So, basically what gregexpr is doing. However I would like to do this for a set of patterns as well as a set of characters (such as XStringSet).
Right now I just apply gregexpr multiple times. Is there a more clever way to do this? I know about vcountPDict. However, I would like to count number of disjoint occurrences rather than total number of occurrences.
Any comment on this?
Toy example
library(Biostrings) data(yeastSEQCHR1) yeast1 <- DNAString(yeastSEQCHR1) x = Views(yeast1, start = sample(length(yeast1),20), width=20) # returns number of disjoint matches for a given pattern gregexpr("AAA",x) # vectorized way to get the number of matches for each motif in the dictionary vcountPDict(DNAStringSet(c("AAA","TTT")), x)
Best,
Stefanie
Hi Herve,
thanks for the comment. This definitely helps. Since I am interested in non-overlapping hits, I would calculate - after the conversion to 'NormalIRangesList' the number of non-overlapping hits by means of the width of the matches and the width of the pattern.
In my case, I have many ( about 800,000 ) subjects and few patterns (~ 100). Is there any possibility to speed things up?
The conversion to 'NormalIRangesList' takes several minutes for each single pattern ...
Thanks a lot!
Hi Stefanie,
Coercion from MIndex to NormalIRangesList was indeed very inefficient. I fixed this in Biostrings 2.38.4. Now it's much faster. Biostrings 2.38.4 should become available via
biocLite()
in about 36 hours.Let me know if you still run into problems with this.
Cheers,
H.