what is the best way to get scores for matches from matchPWM() ?
1
0
Entering edit mode
Lucas Carey ▴ 40
@lucas-carey-3898
Last seen 5.3 years ago
Hi All, I'm wondering what is the best way to get the score for every match from matchPWM() in Biostrings Right now, to score all matches to pwm in genome I do this: #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome mmf <- sapply(1:Nchr, function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) } ) mmr <- sapply(1:Nchr, function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.score= cutoff) } ) mmm <- c(mmf,mmr) #Extract the sequences. RevComp where necessary. Sequences <- c( rapply(mmf,as.character,how='unlist'), sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(comp( s2c(x))))}) ) #convert to DNAStringSet for in order to score. This is quite slow lcl_set <- DNAStringSet(as.character(Sequences)) Scores <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm) This is incredibly inefficient. What is the best way to do this? thanks -Lucas
• 2.3k views
ADD COMMENT
0
Entering edit mode
Patrick Aboyoun ★ 1.6k
@patrick-aboyoun-6734
Last seen 9.6 years ago
United States
Lucas, Your e-mail was very timely since I just had my hands in the matchPWM code in BioC 2.6. First if you are using BioC 2.5/R-2.10, try modifying your first sapply loop to become an lapply loop to see if there is any performance gain: mmf <- lapply(1:Nchr, function(chr) { subject <- genome[[chr]] hits <- matchPWM(pwm, subject, min.score = cutoff) scores <- PWMscoreStartingAt(pwm, subject, starting.at = start(hits)) list(hits = as(hits, "IRanges"), scores = scores) }) If you are more adventurous and are using BioC 2.6/R-devel, I have just updated the matchPWM method for BSgenome objects in the BSgenome package to include the PWM score in the output so the code is much simpler. This updated BSgenome package in BioC 2.6 will be available from bioconductor.org on Thursday or available from svn now. > suppressMessages(library(BSgenome)) > library(BSgenome.Celegans.UCSC.ce2) > data(HNF4alpha) > pwm <- PWM(HNF4alpha) > matchPWM(pwm, Celegans) RangedData with 154706 rows and 3 value columns across 7 spaces space ranges | strand score string <character> <iranges> | <rle> <numeric> <dnastringset> 1 chrI [ 3596, 3608] | + 0.8017528 GGGGCAAAATTAG 2 chrI [ 3880, 3892] | + 0.8291304 GGATTAGAGTTCT 3 chrI [ 5158, 5170] | + 0.8208024 ATTTCAAAGTTTT 4 chrI [ 6128, 6140] | + 0.8540097 GGTGCAAACGTCT 5 chrI [10039, 10051] | + 0.8078732 GGTCTAAAATTCT 6 chrI [10201, 10213] | + 0.8549414 AGACGAGAGGTCA 7 chrI [11388, 11400] | + 0.8137701 GGCACATAGGTCA 8 chrI [12708, 12720] | + 0.8624401 GGGGTAAAGTCAA 9 chrI [13190, 13202] | + 0.8361537 AGTGTAAAGATCT 10 chrI [15609, 15621] | + 0.8061359 GGAGAAAAGGCAA ... <154696 more rows> > sessionInfo() R version 2.11.0 Under development (unstable) (2010-01-18 r50995) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.15.4 [3] Biostrings_2.15.18 IRanges_1.5.29 loaded via a namespace (and not attached): [1] Biobase_2.7.3 tools_2.11.0 Lucas Carey wrote: > Hi All, > I'm wondering what is the best way to get the score for every match > from matchPWM() in Biostrings > > Right now, to score all matches to pwm in genome I do this: > > #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome > mmf <- sapply(1:Nchr, > function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) } ) > mmr <- sapply(1:Nchr, > function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.scor e=cutoff) > } ) > mmm <- c(mmf,mmr) > > #Extract the sequences. RevComp where necessary. > Sequences <- c( rapply(mmf,as.character,how='unlist'), > sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(com p(s2c(x))))}) > ) > > #convert to DNAStringSet for in order to score. This is quite slow > lcl_set <- DNAStringSet(as.character(Sequences)) > Scores <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm) > > This is incredibly inefficient. What is the best way to do this? > > thanks > > -Lucas > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi Patrick, Thanks. Your method, with starting.at = start(hits) brings the function (in a small test case) from 6sec down to 1sec. -Lucas On Jan 20, 2010, at 1:56 PM, Patrick Aboyoun wrote: > Lucas, > Your e-mail was very timely since I just had my hands in the matchPWM code in BioC 2.6. > > First if you are using BioC 2.5/R-2.10, try modifying your first sapply loop to become an lapply loop to see if there is any performance gain: > > mmf <- lapply(1:Nchr, function(chr) { > subject <- genome[[chr]] > hits <- matchPWM(pwm, subject, min.score = cutoff) > scores <- PWMscoreStartingAt(pwm, subject, starting.at = start(hits)) > list(hits = as(hits, "IRanges"), scores = scores) > }) > > > If you are more adventurous and are using BioC 2.6/R-devel, I have just updated the matchPWM method for BSgenome objects in the BSgenome package to include the PWM score in the output so the code is much simpler. This updated BSgenome package in BioC 2.6 will be available from bioconductor.org on Thursday or available from svn now. > > > suppressMessages(library(BSgenome)) > > library(BSgenome.Celegans.UCSC.ce2) > > data(HNF4alpha) > > pwm <- PWM(HNF4alpha) > > matchPWM(pwm, Celegans) > RangedData with 154706 rows and 3 value columns across 7 spaces > space ranges | strand score string > <character> <iranges> | <rle> <numeric> <dnastringset> > 1 chrI [ 3596, 3608] | + 0.8017528 GGGGCAAAATTAG > 2 chrI [ 3880, 3892] | + 0.8291304 GGATTAGAGTTCT > 3 chrI [ 5158, 5170] | + 0.8208024 ATTTCAAAGTTTT > 4 chrI [ 6128, 6140] | + 0.8540097 GGTGCAAACGTCT > 5 chrI [10039, 10051] | + 0.8078732 GGTCTAAAATTCT > 6 chrI [10201, 10213] | + 0.8549414 AGACGAGAGGTCA > 7 chrI [11388, 11400] | + 0.8137701 GGCACATAGGTCA > 8 chrI [12708, 12720] | + 0.8624401 GGGGTAAAGTCAA > 9 chrI [13190, 13202] | + 0.8361537 AGTGTAAAGATCT > 10 chrI [15609, 15621] | + 0.8061359 GGAGAAAAGGCAA > ... > <154696 more rows> > > sessionInfo() > R version 2.11.0 Under development (unstable) (2010-01-18 r50995) > i386-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.15.4 [3] Biostrings_2.15.18 IRanges_1.5.29 > loaded via a namespace (and not attached): > [1] Biobase_2.7.3 tools_2.11.0 > > > > > > Lucas Carey wrote: >> Hi All, >> I'm wondering what is the best way to get the score for every match >> from matchPWM() in Biostrings >> >> Right now, to score all matches to pwm in genome I do this: >> >> #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome >> mmf <- sapply(1:Nchr, >> function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) } ) >> mmr <- sapply(1:Nchr, >> function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.sco re=cutoff) >> } ) >> mmm <- c(mmf,mmr) >> >> #Extract the sequences. RevComp where necessary. >> Sequences <- c( rapply(mmf,as.character,how='unlist'), >> sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(co mp(s2c(x))))}) >> ) >> >> #convert to DNAStringSet for in order to score. This is quite slow >> lcl_set <- DNAStringSet(as.character(Sequences)) >> Scores <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm) >> >> This is incredibly inefficient. What is the best way to do this? >> >> thanks >> >> -Lucas >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY

Login before adding your answer.

Traffic: 748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6