advice on Biostrings
3
0
Entering edit mode
@rafael-a-irizarry-205
Last seen 10.3 years ago
hi im using biostrings to count base content as well as pair of bases content. im using the following sniped of code: ###pmseq is a vector of character strings (not of the same nchar). tmp <- sapply(pmseq,function(x){ y = DNAString(x) c(alphabetFrequency(y)[2:5], ##count A,T,G,C length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) ##count GC or CG }) it is painfully slow. strsplit and grep were much faster for the first part (counting bases) but the using grep for the second part was not straight forward. any suggestions? -r
Biostrings Biostrings • 1.8k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 14 hours ago
Seattle, WA, United States
Hi Rafael, Comparing the speed of > grep("GC", y, fixed=TRUE) with the speed of: > matchDNAPattern("GC", DNAString(y)) is a little unfair since the former will stop searching after it founds the first occurence of "GC" (which is likely to happen very early since nchar("GC") is only 2, probably in the first 10 or 20 letters of a random TGCA string), while the latter will process the entire string in order to count the total number of occurences of "GC". So yes, grep is faster and is all what you need as long as you are not interested in counting the number of matches (or retrieving their offsets). On my system: > library(Biostrings) > y <- scan(file="bigrandomTGCA.txt", what="") Read 1 item > nchar(y) [1] 10000000 > system.time(grep("GC", y, fixed=TRUE)) [1] 0.01 0.00 0.00 0.00 0.00 > dy <- DNAString(y) > system.time(length(matchDNAPattern("GC", dy))) [1] 0.07 0.01 0.08 0.00 0.00 Now if you need to count the number of matches, using length(strsplit()) might be faster than matchDNAPattern() on small strings (nchar < 5000) but it will definetly be __much__ slower on big strings: > nchar(y2) [1] 18314 > system.time(length(strsplit(y2, "A", fixed=TRUE)[[1]])) [1] 0.08 0.00 0.09 0.00 0.00 > dy2 = DNAString(y2) > system.time(length(matchDNAPattern("A", dy2))) [1] 0.02 0.00 0.01 0.00 0.00 Don't even try strsplit() on a 10 millions character string: it will take forever and you won't be able to interrupt with CTRL C... Regards, H. Rafael A Irizarry wrote: > hi im using biostrings to count base content as well as pair of bases > content. im using the following sniped of code: > > > ###pmseq is a vector of character strings (not of the same nchar). > tmp <- sapply(pmseq,function(x){ > y = DNAString(x) > c(alphabetFrequency(y)[2:5], ##count A,T,G,C > length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) > ##count GC or CG > }) > > it is painfully slow. strsplit and grep were much faster for the first > part (counting bases) but the using grep for the second part was not > straight forward. > > any suggestions? > > -r > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- ------------------------ Hervé Pagès E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
Actually, this function in Biostrings uses one of those fiendishly clever algorithms coded in C, and as a consequence is *very* fast: on my puny windows laptop, searching a random sequence of 10 million bp for "GC" returns more than 1/2 million matches in about a 10th of a second: > seq <- paste( dna[ runif( 10000000, min=1, max=5 )], collapse="" ) > system.time( res <- length(matchDNAPattern( "GC", seq ))); res [1] 0.13 0.00 0.12 NA NA [1] 626393 Maybe your speed issues are related to R, or to memory management, or 'user error' ;) ?. Here are suggestions: * If the match pattern "GC" is a character, it is converted to a DNAString by matchDNAPattern. Short-circuit this step by GC <- DNAString("GC") outside the sapply, and then matchDNAPattern( GC, y ) This seems to actually make quite a bit of difference: > seqs <- lapply( 1:1000, function(i) paste( dna[ runif( 100, min=1, max=5 )], collapse="" )) > system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( "GC", s )))) [1] 27.52 0.01 27.61 NA NA > GC <- DNAString("GC") > system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( GC, s )))) [1] 16.53 0.00 16.56 NA NA > * Make sure your function is actually doing what you want. My mistake in exploring this was to write > seq <- paste( dna[ runif( 10000000, min=1, max=5 )]) (i.e., without the 'collapse' argument). this (tried to) create a vector with 10 million single characters. The sapply would then have looped over these, rather than the single string I intended! * Do the calculations in parallel > library(snow) > cl <- makeCluster(4, "MPI") > clusterEvalQ(cl, library(Biostrings)) > system.time(res <- parSapply( cl, seqs, function(s) length(matchDNAPattern( GC, s )))) [1] 0.01 0.00 4.56 0.00 0.00 ! * Perhaps your R is spending a lot of time swapping virtual memory to disk. Try garbage collection gc() before performing the analysis. If this doesn't halp, and it seems like memory really is an issue, then perhaps reading the strings in a chunk at a time would be necessary. You might alos try separating the sapply into three separate calls, one for each function. I doubt this is really a problem, unless you're dealing with *very* large collections of sequence. * I found that analyzing some number of bp in a few strings is much faster than analyzing the same total number of bp in many strings, i.e., the bottleneck is actually in sapply. There are apparently tricks to enhancing sapply speed, though my exploration in the present context didn't provide any dramatic results. Hope these suggestions help... Martin Wolfgang Huber <huber at="" ebi.ac.uk=""> writes: > Rafael A Irizarry wrote: >> hi im using biostrings to count base content as well as pair of bases >> content. im using the following sniped of code: >> > > Hi Rafa, > > to count symbols in character vectors, matchprobes:basecontent is fast: > > library(matchprobes) > v = c("AAACT", "GGGTT", "ggAtT") > bc = basecontent(v) > print.default(bc) > bc[,"C"]+bc[,"G"] > > and if there is interest I'd be happy amend the C code to also count > pairs of bases (or you could, it is not terribly complicated). > > Cheers > Wolfgang > >> >> ###pmseq is a vector of character strings (not of the same nchar). >> tmp <- sapply(pmseq,function(x){ >> y = DNAString(x) >> c(alphabetFrequency(y)[2:5], ##count A,T,G,C >> length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) >> ##count GC or CG >> }) >> >> it is painfully slow. strsplit and grep were much faster for the first >> part (counting bases) but the using grep for the second part was not >> straight forward. >> >> any suggestions? > > > ------------------------------------- > Wolfgang Huber > European Bioinformatics Institute > European Molecular Biology Laboratory > Cambridge CB10 1SD > England > Phone: +44 1223 494642 > Fax: +44 1223 494486 > Http: www.ebi.ac.uk/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor
ADD COMMENT
0
Entering edit mode
thanks everybody. this made it much faster: tmpseq <- DNAString(pmseq) GC <- DNAString("GC") CG <- DNAString("CG") tmp <- sapply(1:length(tmpseq),function(i){ y=tmpseq[i] c(alphabetFrequency(y)[2:5], length(matchDNAPattern(GC,y)), length(matchDNAPattern(CG,y))) }) Martin Morgan wrote: >Actually, this function in Biostrings uses one of those fiendishly >clever algorithms coded in C, and as a consequence is *very* fast: on >my puny windows laptop, searching a random sequence of 10 million bp >for "GC" returns more than 1/2 million matches in about a 10th of a >second: > > > >>seq <- paste( dna[ runif( 10000000, min=1, max=5 )], collapse="" ) >>system.time( res <- length(matchDNAPattern( "GC", seq ))); res >> >> >[1] 0.13 0.00 0.12 NA NA >[1] 626393 > >Maybe your speed issues are related to R, or to memory management, or >'user error' ;) ?. Here are suggestions: > >* If the match pattern "GC" is a character, it is converted to a >DNAString by matchDNAPattern. Short-circuit this step by > >GC <- DNAString("GC") > >outside the sapply, and then > >matchDNAPattern( GC, y ) > >This seems to actually make quite a bit of difference: > > > >>seqs <- lapply( 1:1000, function(i) paste( dna[ runif( 100, min=1, max=5 )], collapse="" )) >>system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( "GC", s )))) >> >> >[1] 27.52 0.01 27.61 NA NA > > >>GC <- DNAString("GC") >>system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( GC, s )))) >> >> >[1] 16.53 0.00 16.56 NA NA > > > >* Make sure your function is actually doing what you want. My mistake >in exploring this was to write > > > >>seq <- paste( dna[ runif( 10000000, min=1, max=5 )]) >> >> > >(i.e., without the 'collapse' argument). this (tried to) create a >vector with 10 million single characters. The sapply would then have >looped over these, rather than the single string I intended! > >* Do the calculations in parallel > > > >>library(snow) >>cl <- makeCluster(4, "MPI") >>clusterEvalQ(cl, library(Biostrings)) >>system.time(res <- parSapply( cl, seqs, function(s) length(matchDNAPattern( GC, s )))) >> >> >[1] 0.01 0.00 4.56 0.00 0.00 > >! > >* Perhaps your R is spending a lot of time swapping virtual memory to >disk. Try garbage collection > >gc() > >before performing the analysis. If this doesn't halp, and it seems >like memory really is an issue, then perhaps reading the strings in a >chunk at a time would be necessary. You might alos try separating the >sapply into three separate calls, one for each function. I doubt this >is really a problem, unless you're dealing with *very* large >collections of sequence. > >* I found that analyzing some number of bp in a few strings is much >faster than analyzing the same total number of bp in many strings, >i.e., the bottleneck is actually in sapply. There are apparently >tricks to enhancing sapply speed, though my exploration in the present >context didn't provide any dramatic results. > >Hope these suggestions help... > >Martin > >Wolfgang Huber <huber at="" ebi.ac.uk=""> writes: > > > >>Rafael A Irizarry wrote: >> >> >>>hi im using biostrings to count base content as well as pair of bases >>>content. im using the following sniped of code: >>> >>> >>> >>Hi Rafa, >> >>to count symbols in character vectors, matchprobes:basecontent is fast: >> >>library(matchprobes) >>v = c("AAACT", "GGGTT", "ggAtT") >>bc = basecontent(v) >>print.default(bc) >>bc[,"C"]+bc[,"G"] >> >>and if there is interest I'd be happy amend the C code to also count >>pairs of bases (or you could, it is not terribly complicated). >> >> Cheers >> Wolfgang >> >> >> >>>###pmseq is a vector of character strings (not of the same nchar). >>>tmp <- sapply(pmseq,function(x){ >>> y = DNAString(x) >>> c(alphabetFrequency(y)[2:5], ##count A,T,G,C >>> length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) >>>##count GC or CG >>>}) >>> >>>it is painfully slow. strsplit and grep were much faster for the first >>>part (counting bases) but the using grep for the second part was not >>>straight forward. >>> >>>any suggestions? >>> >>> >>------------------------------------- >>Wolfgang Huber >>European Bioinformatics Institute >>European Molecular Biology Laboratory >>Cambridge CB10 1SD >>England >>Phone: +44 1223 494642 >>Fax: +44 1223 494486 >>Http: www.ebi.ac.uk/huber >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> > > >
ADD REPLY
0
Entering edit mode
@wolfgang-huber-3550
Last seen 3 months ago
EMBL European Molecular Biology Laborat…
Rafael A Irizarry wrote: > hi im using biostrings to count base content as well as pair of bases > content. im using the following sniped of code: > Hi Rafa, to count symbols in character vectors, matchprobes:basecontent is fast: library(matchprobes) v = c("AAACT", "GGGTT", "ggAtT") bc = basecontent(v) print.default(bc) bc[,"C"]+bc[,"G"] and if there is interest I'd be happy amend the C code to also count pairs of bases (or you could, it is not terribly complicated). Cheers Wolfgang > > ###pmseq is a vector of character strings (not of the same nchar). > tmp <- sapply(pmseq,function(x){ > y = DNAString(x) > c(alphabetFrequency(y)[2:5], ##count A,T,G,C > length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) > ##count GC or CG > }) > > it is painfully slow. strsplit and grep were much faster for the first > part (counting bases) but the using grep for the second part was not > straight forward. > > any suggestions? ------------------------------------- Wolfgang Huber European Bioinformatics Institute European Molecular Biology Laboratory Cambridge CB10 1SD England Phone: +44 1223 494642 Fax: +44 1223 494486 Http: www.ebi.ac.uk/huber
ADD COMMENT

Login before adding your answer.

Traffic: 774 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6