Biostrings - vcountPattern optimization
2
0
Entering edit mode
Erik Wright ▴ 210
@erik-wright-4003
Last seen 10.4 years ago
Hello, Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits": #### start #### library(Biostrings) fragments <- DNAStringSet(c("ACTG","AAAA")) sequence_set <- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) for (i in 1:length(fragments)) { counts <- vcountPattern(fragments[[i]], sequence_set, max.mismatch=1) hits <- length(which(counts > 0)) print(hits) } #### end #### This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently? Thanks!, Erik > sessionInfo() R version 2.11.0 (2010-04-22) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.16.0 IRanges_1.6.0 loaded via a namespace (and not attached): [1] Biobase_2.8.0
• 3.0k views
ADD COMMENT
0
Entering edit mode
Patrick Aboyoun ★ 1.6k
@patrick-aboyoun-6734
Last seen 10.3 years ago
United States
Erik, Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho?Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive. library(Biostrings) fragments<- DNAStringSet(c("ACTG","AAAA")) sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) pdict<- PDict(fragments) counts<- vcountPDict(pdict, sequence_set) > counts [,1] [,2] [1,] 0 0 [2,] 0 0 > sessionInfo() R version 2.12.0 Under development (unstable) (2010-07-18 r52554) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.17.26 IRanges_1.7.13 loaded via a namespace (and not attached): [1] Biobase_2.9.0 tools_2.12.0 Patrick On 7/22/10 8:54 AM, Erik Wright wrote: > Hello, > > Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits": > > #### start #### > library(Biostrings) > fragments<- DNAStringSet(c("ACTG","AAAA")) > sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) > > for (i in 1:length(fragments)) { > counts<- vcountPattern(fragments[[i]], > sequence_set, > max.mismatch=1) > hits<- length(which(counts> 0)) > print(hits) > } > #### end #### > > This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently? > > Thanks!, > Erik > > > > > >> sessionInfo() >> > R version 2.11.0 (2010-04-22) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.16.0 IRanges_1.6.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi Patrick, Thanks, this looks promising. Three possible complications are: (1) The fragments are not all the same width. Is this possible with Pdict? (2) I allow a variable number of mismatches based on each individual fragment's width. (3) The fragments sometimes include ambiguity letters (IUPAC extended letters). A more accurate example would be: #### start #### fragments <- DNAStringSet(c("ACS","NCCAGAA")) # no indels sequence_set <- DNAStringSet(c("ATAGCATACKACCA","GATTACGTACCADADATTACA") # variable widths for (i in 1:length(fragments)) { counts <- vcountPattern(fragments[[i]], sequence_set, max.mismatch=floor(length(fragments[[i]])/5)) # variable mis-matches hits <- length(which(counts > 0)) print(hits) } #### end #### Do think it is possible to make this work Pdict for a speed improvement? Thanks again!, Erik On Jul 22, 2010, at 12:11 PM, Patrick Aboyoun wrote: > Erik, > Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho?Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive. > > library(Biostrings) > fragments<- DNAStringSet(c("ACTG","AAAA")) > sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) > pdict<- PDict(fragments) > counts<- vcountPDict(pdict, sequence_set) > >> counts > [,1] [,2] > [1,] 0 0 > [2,] 0 0 > >> sessionInfo() > R version 2.12.0 Under development (unstable) (2010-07-18 r52554) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.17.26 IRanges_1.7.13 > > loaded via a namespace (and not attached): > [1] Biobase_2.9.0 tools_2.12.0 > > > > > Patrick > > > On 7/22/10 8:54 AM, Erik Wright wrote: >> Hello, >> >> Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits": >> >> #### start #### >> library(Biostrings) >> fragments<- DNAStringSet(c("ACTG","AAAA")) >> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) >> >> for (i in 1:length(fragments)) { >> counts<- vcountPattern(fragments[[i]], >> sequence_set, >> max.mismatch=1) >> hits<- length(which(counts> 0)) >> print(hits) >> } >> #### end #### >> >> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently? >> >> Thanks!, >> Erik >> >> >> >> >> >>> sessionInfo() >>> >> R version 2.11.0 (2010-04-22) >> x86_64-apple-darwin9.8.0 >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biostrings_2.16.0 IRanges_1.6.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY
0
Entering edit mode
Hi Erik, On 07/22/2010 10:32 AM, Erik Wright wrote: > Hi Patrick, > > Thanks, this looks promising. Three possible complications are: > (1) The fragments are not all the same width. Is this possible with Pdict? Yes, but given requirement (2), you need another solution. > (2) I allow a variable number of mismatches based on each individual fragment's width. So given (1) and (2), you could group your fragments by equal length, make a PDict object for each group, and use a single number of mismatches for that group (seems like this number only depends on the length of the fragment). > (3) The fragments sometimes include ambiguity letters (IUPAC extended letters). Unfortunately ambiguities are supported only in the subject at the moment. But you could still treat them separately with vcountPattern() in a loop. > > A more accurate example would be: > > #### start #### > fragments<- DNAStringSet(c("ACS","NCCAGAA")) # no indels > sequence_set<- DNAStringSet(c("ATAGCATACKACCA","GATTACGTACCADADATTACA") # variable widths > for (i in 1:length(fragments)) { > counts<- vcountPattern(fragments[[i]], > sequence_set, > max.mismatch=floor(length(fragments[[i]])/5)) # variable mis-matches > hits<- length(which(counts> 0)) > print(hits) > } > #### end #### > > Do think it is possible to make this work Pdict for a speed improvement? With max.mismatch being a fifth of the fragment length that means it will be between 6 (for 30bp fragments) and 26 (for 130bp fragments). Unfortunately, that's way too many mismatches PDict()/vcountPDict() can handle. Cheers, H. > > Thanks again!, > Erik > > > > On Jul 22, 2010, at 12:11 PM, Patrick Aboyoun wrote: > >> Erik, >> Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho?Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive. >> >> library(Biostrings) >> fragments<- DNAStringSet(c("ACTG","AAAA")) >> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) >> pdict<- PDict(fragments) >> counts<- vcountPDict(pdict, sequence_set) >> >>> counts >> [,1] [,2] >> [1,] 0 0 >> [2,] 0 0 >> >>> sessionInfo() >> R version 2.12.0 Under development (unstable) (2010-07-18 r52554) >> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biostrings_2.17.26 IRanges_1.7.13 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.9.0 tools_2.12.0 >> >> >> >> >> Patrick >> >> >> On 7/22/10 8:54 AM, Erik Wright wrote: >>> Hello, >>> >>> Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits": >>> >>> #### start #### >>> library(Biostrings) >>> fragments<- DNAStringSet(c("ACTG","AAAA")) >>> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) >>> >>> for (i in 1:length(fragments)) { >>> counts<- vcountPattern(fragments[[i]], >>> sequence_set, >>> max.mismatch=1) >>> hits<- length(which(counts> 0)) >>> print(hits) >>> } >>> #### end #### >>> >>> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently? >>> >>> Thanks!, >>> Erik >>> >>> >>> >>> >>> >>>> sessionInfo() >>>> >>> R version 2.11.0 (2010-04-22) >>> x86_64-apple-darwin9.8.0 >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] Biostrings_2.16.0 IRanges_1.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.8.0 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
@steve-lianoglou-2771
Last seen 22 months ago
United States
Hi, On Thu, Jul 22, 2010 at 11:54 AM, Erik Wright <eswright at="" wisc.edu=""> wrote: > Hello, > > Lately I have been working on counting sequence fragments in larger sets of sequences. ?I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. ?Currently I am using the following method to count the number of "hits": Would using bowtie as an intermediary be an option? For instance, you could consider: (i) making a bowtie-index out of your 1200-1600 bp "references" (ii) aligning your 30-130bp fragments agains it and output to SAM format (give each read a unique id so you can hunt for it later) (iii) convert SAM -> indexed BAM (iv) process bam file w/ Rsamtools -- perhaps you could simply do a `table()` on the sequence IDs of each alignment if all you want is a count -- of course now that the sequences are aligned, the data is in "good shape" to do other types of analyses as well (whatever it is that you're doing). > #### start #### > library(Biostrings) > fragments <- DNAStringSet(c("ACTG","AAAA")) > sequence_set <- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) > > for (i in 1:length(fragments)) { > ? ? ? ?counts <- vcountPattern(fragments[[i]], > ? ? ? ? ? ? ? ?sequence_set, > ? ? ? ? ? ? ? ?max.mismatch=1) > ? ? ? ?hits <- length(which(counts > 0)) > ? ? ? ?print(hits) > } > #### end #### > > This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? ?Does anyone have a suggestion for how I can accomplish the same task more efficiently? I don't really have any suggestions to make the above R code run faster ... sorry. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT

Login before adding your answer.

Traffic: 468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6