Biostrings - vcountPattern optimization

0

Entering edit mode

Erik Wright ▴ 210

@erik-wright-4003

Last seen 9.6 years ago

Hello, Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits": #### start #### library(Biostrings) fragments <- DNAStringSet(c("ACTG","AAAA")) sequence_set <- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) for (i in 1:length(fragments)) { counts <- vcountPattern(fragments[[i]], sequence_set, max.mismatch=1) hits <- length(which(counts > 0)) print(hits) } #### end #### This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently? Thanks!, Erik > sessionInfo() R version 2.11.0 (2010-04-22) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.16.0 IRanges_1.6.0 loaded via a namespace (and not attached): [1] Biobase_2.8.0

• 2.6k views

ADD COMMENT • link updated 13.8 years ago by Patrick Aboyoun ★ 1.6k • written 13.8 years ago by Erik Wright ▴ 210

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 9.6 years ago

United States

Erik, Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho?Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive. library(Biostrings) fragments<- DNAStringSet(c("ACTG","AAAA")) sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) pdict<- PDict(fragments) counts<- vcountPDict(pdict, sequence_set) > counts [,1] [,2] [1,] 0 0 [2,] 0 0 > sessionInfo() R version 2.12.0 Under development (unstable) (2010-07-18 r52554) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.17.26 IRanges_1.7.13 loaded via a namespace (and not attached): [1] Biobase_2.9.0 tools_2.12.0 Patrick On 7/22/10 8:54 AM, Erik Wright wrote: > Hello, > > Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits": > > #### start #### > library(Biostrings) > fragments<- DNAStringSet(c("ACTG","AAAA")) > sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) > > for (i in 1:length(fragments)) { > counts<- vcountPattern(fragments[[i]], > sequence_set, > max.mismatch=1) > hits<- length(which(counts> 0)) > print(hits) > } > #### end #### > > This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently? > > Thanks!, > Erik > > > > > >> sessionInfo() >> > R version 2.11.0 (2010-04-22) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.16.0 IRanges_1.6.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.8 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Hi Patrick, Thanks, this looks promising. Three possible complications are: (1) The fragments are not all the same width. Is this possible with Pdict? (2) I allow a variable number of mismatches based on each individual fragment's width. (3) The fragments sometimes include ambiguity letters (IUPAC extended letters). A more accurate example would be: #### start #### fragments <- DNAStringSet(c("ACS","NCCAGAA")) # no indels sequence_set <- DNAStringSet(c("ATAGCATACKACCA","GATTACGTACCADADATTACA") # variable widths for (i in 1:length(fragments)) { counts <- vcountPattern(fragments[[i]], sequence_set, max.mismatch=floor(length(fragments[[i]])/5)) # variable mis-matches hits <- length(which(counts > 0)) print(hits) } #### end #### Do think it is possible to make this work Pdict for a speed improvement? Thanks again!, Erik On Jul 22, 2010, at 12:11 PM, Patrick Aboyoun wrote: > Erik, > Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho?Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive. > > library(Biostrings) > fragments<- DNAStringSet(c("ACTG","AAAA")) > sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) > pdict<- PDict(fragments) > counts<- vcountPDict(pdict, sequence_set) > >> counts > [,1] [,2] > [1,] 0 0 > [2,] 0 0 > >> sessionInfo() > R version 2.12.0 Under development (unstable) (2010-07-18 r52554) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.17.26 IRanges_1.7.13 > > loaded via a namespace (and not attached): > [1] Biobase_2.9.0 tools_2.12.0 > > > > > Patrick > > > On 7/22/10 8:54 AM, Erik Wright wrote: >> Hello, >> >> Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits": >> >> #### start #### >> library(Biostrings) >> fragments<- DNAStringSet(c("ACTG","AAAA")) >> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) >> >> for (i in 1:length(fragments)) { >> counts<- vcountPattern(fragments[[i]], >> sequence_set, >> max.mismatch=1) >> hits<- length(which(counts> 0)) >> print(hits) >> } >> #### end #### >> >> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently? >> >> Thanks!, >> Erik >> >> >> >> >> >>> sessionInfo() >>> >> R version 2.11.0 (2010-04-22) >> x86_64-apple-darwin9.8.0 >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biostrings_2.16.0 IRanges_1.6.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 13.8 years ago Erik Wright ▴ 210

0

Entering edit mode

Hi Erik, On 07/22/2010 10:32 AM, Erik Wright wrote: > Hi Patrick, > > Thanks, this looks promising. Three possible complications are: > (1) The fragments are not all the same width. Is this possible with Pdict? Yes, but given requirement (2), you need another solution. > (2) I allow a variable number of mismatches based on each individual fragment's width. So given (1) and (2), you could group your fragments by equal length, make a PDict object for each group, and use a single number of mismatches for that group (seems like this number only depends on the length of the fragment). > (3) The fragments sometimes include ambiguity letters (IUPAC extended letters). Unfortunately ambiguities are supported only in the subject at the moment. But you could still treat them separately with vcountPattern() in a loop. > > A more accurate example would be: > > #### start #### > fragments<- DNAStringSet(c("ACS","NCCAGAA")) # no indels > sequence_set<- DNAStringSet(c("ATAGCATACKACCA","GATTACGTACCADADATTACA") # variable widths > for (i in 1:length(fragments)) { > counts<- vcountPattern(fragments[[i]], > sequence_set, > max.mismatch=floor(length(fragments[[i]])/5)) # variable mis-matches > hits<- length(which(counts> 0)) > print(hits) > } > #### end #### > > Do think it is possible to make this work Pdict for a speed improvement? With max.mismatch being a fifth of the fragment length that means it will be between 6 (for 30bp fragments) and 26 (for 130bp fragments). Unfortunately, that's way too many mismatches PDict()/vcountPDict() can handle. Cheers, H. > > Thanks again!, > Erik > > > > On Jul 22, 2010, at 12:11 PM, Patrick Aboyoun wrote: > >> Erik, >> Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho?Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive. >> >> library(Biostrings) >> fragments<- DNAStringSet(c("ACTG","AAAA")) >> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) >> pdict<- PDict(fragments) >> counts<- vcountPDict(pdict, sequence_set) >> >>> counts >> [,1] [,2] >> [1,] 0 0 >> [2,] 0 0 >> >>> sessionInfo() >> R version 2.12.0 Under development (unstable) (2010-07-18 r52554) >> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biostrings_2.17.26 IRanges_1.7.13 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.9.0 tools_2.12.0 >> >> >> >> >> Patrick >> >> >> On 7/22/10 8:54 AM, Erik Wright wrote: >>> Hello, >>> >>> Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits": >>> >>> #### start #### >>> library(Biostrings) >>> fragments<- DNAStringSet(c("ACTG","AAAA")) >>> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) >>> >>> for (i in 1:length(fragments)) { >>> counts<- vcountPattern(fragments[[i]], >>> sequence_set, >>> max.mismatch=1) >>> hits<- length(which(counts> 0)) >>> print(hits) >>> } >>> #### end #### >>> >>> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently? >>> >>> Thanks!, >>> Erik >>> >>> >>> >>> >>> >>>> sessionInfo() >>>> >>> R version 2.11.0 (2010-04-22) >>> x86_64-apple-darwin9.8.0 >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] Biostrings_2.16.0 IRanges_1.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.8.0 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.8 years ago Hervé Pagès 16k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Thu, Jul 22, 2010 at 11:54 AM, Erik Wright <eswright at="" wisc.edu=""> wrote: > Hello, > > Lately I have been working on counting sequence fragments in larger sets of sequences. ?I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. ?Currently I am using the following method to count the number of "hits": Would using bowtie as an intermediary be an option? For instance, you could consider: (i) making a bowtie-index out of your 1200-1600 bp "references" (ii) aligning your 30-130bp fragments agains it and output to SAM format (give each read a unique id so you can hunt for it later) (iii) convert SAM -> indexed BAM (iv) process bam file w/ Rsamtools -- perhaps you could simply do a `table()` on the sequence IDs of each alignment if all you want is a count -- of course now that the sequences are aligned, the data is in "good shape" to do other types of analyses as well (whatever it is that you're doing). > #### start #### > library(Biostrings) > fragments <- DNAStringSet(c("ACTG","AAAA")) > sequence_set <- DNAStringSet(c("TAGACATGAC","ACCAAATGAC")) > > for (i in 1:length(fragments)) { > ? ? ? ?counts <- vcountPattern(fragments[[i]], > ? ? ? ? ? ? ? ?sequence_set, > ? ? ? ? ? ? ? ?max.mismatch=1) > ? ? ? ?hits <- length(which(counts > 0)) > ? ? ? ?print(hits) > } > #### end #### > > This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? ?Does anyone have a suggestion for how I can accomplish the same task more efficiently? I don't really have any suggestions to make the above R code run faster ... sorry. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 13.8 years ago Steve Lianoglou ★ 13k

Login before adding your answer.