what is the best way to get scores for matches from matchPWM() ?

0

Entering edit mode

Lucas Carey ▴ 40

@lucas-carey-3898

Last seen 7.1 years ago

Hi All, I'm wondering what is the best way to get the score for every match from matchPWM() in Biostrings Right now, to score all matches to pwm in genome I do this: #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome mmf <- sapply(1:Nchr, function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) } ) mmr <- sapply(1:Nchr, function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.score= cutoff) } ) mmm <- c(mmf,mmr) #Extract the sequences. RevComp where necessary. Sequences <- c( rapply(mmf,as.character,how='unlist'), sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(comp( s2c(x))))}) ) #convert to DNAStringSet for in order to score. This is quite slow lcl_set <- DNAStringSet(as.character(Sequences)) Scores <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm) This is incredibly inefficient. What is the best way to do this? thanks -Lucas

• 2.9k views

ADD COMMENT • link updated 16.1 years ago by Patrick Aboyoun ★ 1.6k • written 16.1 years ago by Lucas Carey ▴ 40

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 11.4 years ago

United States

Lucas, Your e-mail was very timely since I just had my hands in the matchPWM code in BioC 2.6. First if you are using BioC 2.5/R-2.10, try modifying your first sapply loop to become an lapply loop to see if there is any performance gain: mmf <- lapply(1:Nchr, function(chr) { subject <- genome[[chr]] hits <- matchPWM(pwm, subject, min.score = cutoff) scores <- PWMscoreStartingAt(pwm, subject, starting.at = start(hits)) list(hits = as(hits, "IRanges"), scores = scores) }) If you are more adventurous and are using BioC 2.6/R-devel, I have just updated the matchPWM method for BSgenome objects in the BSgenome package to include the PWM score in the output so the code is much simpler. This updated BSgenome package in BioC 2.6 will be available from bioconductor.org on Thursday or available from svn now. > suppressMessages(library(BSgenome)) > library(BSgenome.Celegans.UCSC.ce2) > data(HNF4alpha) > pwm <- PWM(HNF4alpha) > matchPWM(pwm, Celegans) RangedData with 154706 rows and 3 value columns across 7 spaces space ranges | strand score string <character> <iranges> | <rle> <numeric> <dnastringset> 1 chrI [ 3596, 3608] | + 0.8017528 GGGGCAAAATTAG 2 chrI [ 3880, 3892] | + 0.8291304 GGATTAGAGTTCT 3 chrI [ 5158, 5170] | + 0.8208024 ATTTCAAAGTTTT 4 chrI [ 6128, 6140] | + 0.8540097 GGTGCAAACGTCT 5 chrI [10039, 10051] | + 0.8078732 GGTCTAAAATTCT 6 chrI [10201, 10213] | + 0.8549414 AGACGAGAGGTCA 7 chrI [11388, 11400] | + 0.8137701 GGCACATAGGTCA 8 chrI [12708, 12720] | + 0.8624401 GGGGTAAAGTCAA 9 chrI [13190, 13202] | + 0.8361537 AGTGTAAAGATCT 10 chrI [15609, 15621] | + 0.8061359 GGAGAAAAGGCAA ... <154696 more rows> > sessionInfo() R version 2.11.0 Under development (unstable) (2010-01-18 r50995) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.15.4 [3] Biostrings_2.15.18 IRanges_1.5.29 loaded via a namespace (and not attached): [1] Biobase_2.7.3 tools_2.11.0 Lucas Carey wrote: > Hi All, > I'm wondering what is the best way to get the score for every match > from matchPWM() in Biostrings > > Right now, to score all matches to pwm in genome I do this: > > #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome > mmf <- sapply(1:Nchr, > function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) } ) > mmr <- sapply(1:Nchr, > function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.scor e=cutoff) > } ) > mmm <- c(mmf,mmr) > > #Extract the sequences. RevComp where necessary. > Sequences <- c( rapply(mmf,as.character,how='unlist'), > sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(com p(s2c(x))))}) > ) > > #convert to DNAStringSet for in order to score. This is quite slow > lcl_set <- DNAStringSet(as.character(Sequences)) > Scores <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm) > > This is incredibly inefficient. What is the best way to do this? > > thanks > > -Lucas > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 16.1 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Hi Patrick, Thanks. Your method, with starting.at = start(hits) brings the function (in a small test case) from 6sec down to 1sec. -Lucas On Jan 20, 2010, at 1:56 PM, Patrick Aboyoun wrote: > Lucas, > Your e-mail was very timely since I just had my hands in the matchPWM code in BioC 2.6. > > First if you are using BioC 2.5/R-2.10, try modifying your first sapply loop to become an lapply loop to see if there is any performance gain: > > mmf <- lapply(1:Nchr, function(chr) { > subject <- genome[[chr]] > hits <- matchPWM(pwm, subject, min.score = cutoff) > scores <- PWMscoreStartingAt(pwm, subject, starting.at = start(hits)) > list(hits = as(hits, "IRanges"), scores = scores) > }) > > > If you are more adventurous and are using BioC 2.6/R-devel, I have just updated the matchPWM method for BSgenome objects in the BSgenome package to include the PWM score in the output so the code is much simpler. This updated BSgenome package in BioC 2.6 will be available from bioconductor.org on Thursday or available from svn now. > > > suppressMessages(library(BSgenome)) > > library(BSgenome.Celegans.UCSC.ce2) > > data(HNF4alpha) > > pwm <- PWM(HNF4alpha) > > matchPWM(pwm, Celegans) > RangedData with 154706 rows and 3 value columns across 7 spaces > space ranges | strand score string > <character> <iranges> | <rle> <numeric> <dnastringset> > 1 chrI [ 3596, 3608] | + 0.8017528 GGGGCAAAATTAG > 2 chrI [ 3880, 3892] | + 0.8291304 GGATTAGAGTTCT > 3 chrI [ 5158, 5170] | + 0.8208024 ATTTCAAAGTTTT > 4 chrI [ 6128, 6140] | + 0.8540097 GGTGCAAACGTCT > 5 chrI [10039, 10051] | + 0.8078732 GGTCTAAAATTCT > 6 chrI [10201, 10213] | + 0.8549414 AGACGAGAGGTCA > 7 chrI [11388, 11400] | + 0.8137701 GGCACATAGGTCA > 8 chrI [12708, 12720] | + 0.8624401 GGGGTAAAGTCAA > 9 chrI [13190, 13202] | + 0.8361537 AGTGTAAAGATCT > 10 chrI [15609, 15621] | + 0.8061359 GGAGAAAAGGCAA > ... > <154696 more rows> > > sessionInfo() > R version 2.11.0 Under development (unstable) (2010-01-18 r50995) > i386-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.15.4 [3] Biostrings_2.15.18 IRanges_1.5.29 > loaded via a namespace (and not attached): > [1] Biobase_2.7.3 tools_2.11.0 > > > > > > Lucas Carey wrote: >> Hi All, >> I'm wondering what is the best way to get the score for every match >> from matchPWM() in Biostrings >> >> Right now, to score all matches to pwm in genome I do this: >> >> #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome >> mmf <- sapply(1:Nchr, >> function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) } ) >> mmr <- sapply(1:Nchr, >> function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.sco re=cutoff) >> } ) >> mmm <- c(mmf,mmr) >> >> #Extract the sequences. RevComp where necessary. >> Sequences <- c( rapply(mmf,as.character,how='unlist'), >> sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(co mp(s2c(x))))}) >> ) >> >> #convert to DNAStringSet for in order to score. This is quite slow >> lcl_set <- DNAStringSet(as.character(Sequences)) >> Scores <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm) >> >> This is incredibly inefficient. What is the best way to do this? >> >> thanks >> >> -Lucas >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 16.1 years ago Lucas Carey ▴ 40

Login before adding your answer.