matchPattern vmatchPattern vectorised

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.4 years ago

Hi I am trying to write a package that will make a few shortcuts for my lazy coworkers. So I wrote a few bits of code that will find their primers in amongst a fastq of multiplexed reads (e.g 10-20). Next I thought I would save them the trouble of copy pasting Primers, Chromosome, and Start into a shell script, by instead autogenerating the script - We have the excellent BSgenome and Mmusculus9 packages installed so this seems a good starting point: So for the first primer this works well: > system.time(vmatchPattern("CCAGCACTGTATAGCCGATC", Mmusculus)) user system elapsed 45.853 2.702 50.273 This is fine for a single primer but it seems from the docs (and testing) that if I want to lookup 15 primers it will take 15 passes through the genome and 15x as long. About the same time it would take them to just copy them from their lab-books. I guess they could have a coffee...still... My first question: Is there another function or package on BioC that I have missed that might help me with this? Or low level functions I should look at to build a vectorised search (exact match) through Mmusculus? And second I guess is a feature suggestion: Why not allow matchPattern to pass once through the genome comparing a set(char vector , DNAStringSet etc) to the subject? This seems to require little extra computational load (I think). And given the difficulty of using BLAST within R might be very useful extension. thx Stephen -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BSgenome_1.30.0 Biostrings_2.30.1 [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 [7] BiocGenerics_0.8.0 data.table_1.8.10 dplyr_0.1 [10] hflights_0.1 Rcpp_0.10.6 loaded via a namespace (and not attached): [1] assertthat_0.1 devtools_1.4.1 digest_0.6.4 evaluate_0.5.1 formatR_0.10 httr_0.2 knitr_1.5 [8] memoise_0.1 RCurl_1.95-4.1 stats4_3.0.2 stringr_0.6.2 tools_3.0.2 whisker_0.3-2 -- Sent via the guest posting facility at bioconductor.org.

BSgenome BSgenome BSgenome BSgenome • 2.1k views

ADD COMMENT • link updated 12.1 years ago by Steve Lianoglou ★ 13k • written 12.1 years ago by Guest User ★ 13k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 11 weeks ago

United States

Hi, On Sat, Dec 14, 2013 at 6:00 AM, Stephen [guest] <guest at="" bioconductor.org=""> wrote: > > Hi > I am trying to write a package that will make a few shortcuts for my lazy coworkers. So I wrote a few bits of code that will find their primers in amongst a fastq of multiplexed reads (e.g 10-20). > > Next I thought I would save them the trouble of copy pasting Primers, Chromosome, and Start into a shell script, by instead autogenerating the script - We have the excellent BSgenome and Mmusculus9 packages installed so this seems a good starting point: > > So for the first primer this works well: >> system.time(vmatchPattern("CCAGCACTGTATAGCCGATC", Mmusculus)) > user system elapsed > 45.853 2.702 50.273 > > This is fine for a single primer but it seems from the docs (and testing) that if I want to lookup 15 primers it will take 15 passes through the genome and 15x as long. About the same time it would take them to just copy them from their lab-books. I guess they could have a coffee...still... > > My first question: Is there another function or package on BioC that I have missed that might help me with this? Or low level functions I should look at to build a vectorised search (exact match) through Mmusculus? There are packages which wrap aligners that you might consider using: * gmapR http://bioconductor.org/packages/release/bioc/html/gmapR.html * Rsubread: http://bioconductor.org/packages/release/bioc/html/Rsubread.html HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 12.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

thx Steve I hadn't heard of gmapR (and GSNAP) I will check it now. The Rsubread package is a possible as I have it installed and planned to use it later in the pipeline so there is little extra overhead -- in my case. However a quick look at the documentation and code (of both) suggests that they are creating indexed genomes, which is of course appropriate for multimillion read aligners. Yet I still think there is a gap for a simple tool (such a vmatchPattern) that matches a small exact vector or StringSet of patterns - rather than just a singleton in one pass. It's a pretty common mol bio lab task (what with all the new multiplexing techs). Stephen ---------------------- On Sat, Dec 14, 2013 at 6:00 AM, Stephen [guest] <guest at="" bioconductor.org=""> wrote: > > Hi > I am trying to write a package that will make a few shortcuts for my lazy coworkers. So I wrote a few bits of code that will find their primers in amongst a fastq of multiplexed reads (e.g 10-20). > > Next I thought I would save them the trouble of copy pasting Primers, Chromosome, and Start into a shell script, by instead autogenerating the script - We have the excellent BSgenome and Mmusculus9 packages installed so this seems a good starting point: > > So for the first primer this works well: >> system.time(vmatchPattern("CCAGCACTGTATAGCCGATC", Mmusculus)) > user system elapsed > 45.853 2.702 50.273 > > This is fine for a single primer but it seems from the docs (and testing) that if I want to lookup 15 primers it will take 15 passes through the genome and 15x as long. About the same time it would take them to just copy them from their lab-books. I guess they could have a coffee...still... > > My first question: Is there another function or package on BioC that I have missed that might help me with this? Or low level functions I should look at to build a vectorised search (exact match) through Mmusculus? There are packages which wrap aligners that you might consider using: * gmapR http://bioconductor.org/packages/release/bioc/html/gmapR.html * Rsubread: http://bioconductor.org/packages/release/bioc/html/Rsubread.html HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 12.1 years ago Stephen Henderson ★ 1.0k

0

Entering edit mode

Hi Stephen, On 12/14/2013 08:05 AM, Henderson, Stephen wrote: > thx Steve > > I hadn't heard of gmapR (and GSNAP) I will check it now. The Rsubread package is a possible as I have it installed and planned to use it later in the pipeline so there is little extra overhead -- in my case. > > However a quick look at the documentation and code (of both) suggests that they are creating indexed genomes, which is of course appropriate for multimillion read aligners. > > Yet I still think there is a gap for a simple tool (such a vmatchPattern) that matches a small exact vector or StringSet of patterns - rather than just a singleton in one pass. It's a pretty common mol bio lab task (what with all the new multiplexing techs). matchPDict() does that. It will be fast (i.e. do only one pass) if all your patterns have the same length, in which case you should preprocess them with PDict() before passing them to matchPDict(). Otherwise (i.e. without preprocessing), matchPDict() will just call matchPattern() in a loop. See ?matchPDict in the Biostrings package. H. > > Stephen > > ---------------------- > > On Sat, Dec 14, 2013 at 6:00 AM, Stephen [guest] <guest at="" bioconductor.org=""> wrote: >> >> Hi >> I am trying to write a package that will make a few shortcuts for my lazy coworkers. So I wrote a few bits of code that will find their primers in amongst a fastq of multiplexed reads (e.g 10-20). >> >> Next I thought I would save them the trouble of copy pasting Primers, Chromosome, and Start into a shell script, by instead autogenerating the script - We have the excellent BSgenome and Mmusculus9 packages installed so this seems a good starting point: >> >> So for the first primer this works well: >>> system.time(vmatchPattern("CCAGCACTGTATAGCCGATC", Mmusculus)) >> user system elapsed >> 45.853 2.702 50.273 >> >> This is fine for a single primer but it seems from the docs (and testing) that if I want to lookup 15 primers it will take 15 passes through the genome and 15x as long. About the same time it would take them to just copy them from their lab-books. I guess they could have a coffee...still... >> >> My first question: Is there another function or package on BioC that I have missed that might help me with this? Or low level functions I should look at to build a vectorised search (exact match) through Mmusculus? > > There are packages which wrap aligners that you might consider using: > > * gmapR > http://bioconductor.org/packages/release/bioc/html/gmapR.html > > * Rsubread: > http://bioconductor.org/packages/release/bioc/html/Rsubread.html > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.1 years ago Hervé Pagès 16k

0

Entering edit mode

matchPDict! Fantastic. That's exactly it. Sorry I missed that. Many thanks. Stephen On Dec 14, 2013 8:10 PM, =?ISO-8859-1?Q?Herv=E9_Pag=E8s?= <hpages@fhcrc.org> wrote: Hi Stephen, On 12/14/2013 08:05 AM, Henderson, Stephen wrote: > thx Steve > > I hadn't heard of gmapR (and GSNAP) I will check it now. The Rsubread package is a possible as I have it installed and planned to use it later in the pipeline so there is little extra overhead -- in my case. > > However a quick look at the documentation and code (of both) suggests that they are creating indexed genomes, which is of course appropriate for multimillion read aligners. > > Yet I still think there is a gap for a simple tool (such a vmatchPattern) that matches a small exact vector or StringSet of patterns - rather than just a singleton in one pass. It's a pretty common mol bio lab task (what with all the new multiplexing techs). matchPDict() does that. It will be fast (i.e. do only one pass) if all your patterns have the same length, in which case you should preprocess them with PDict() before passing them to matchPDict(). Otherwise (i.e. without preprocessing), matchPDict() will just call matchPattern() in a loop. See ?matchPDict in the Biostrings package. H. > > Stephen > > ---------------------- > > On Sat, Dec 14, 2013 at 6:00 AM, Stephen [guest] <guest@bioconductor.org> wrote: >> >> Hi >> I am trying to write a package that will make a few shortcuts for my lazy coworkers. So I wrote a few bits of code that will find their primers in amongst a fastq of multiplexed reads (e.g 10-20). >> >> Next I thought I would save them the trouble of copy pasting Primers, Chromosome, and Start into a shell script, by instead autogenerating the script - We have the excellent BSgenome and Mmusculus9 packages installed so this seems a good starting point: >> >> So for the first primer this works well: >>> system.time(vmatchPattern("CCAGCACTGTATAGCCGATC", Mmusculus)) >> user system elapsed >> 45.853 2.702 50.273 >> >> This is fine for a single primer but it seems from the docs (and testing) that if I want to lookup 15 primers it will take 15 passes through the genome and 15x as long. About the same time it would take them to just copy them from their lab-books. I guess they could have a coffee...still... >> >> My first question: Is there another function or package on BioC that I have missed that might help me with this? Or low level functions I should look at to build a vectorised search (exact match) through Mmusculus? > > There are packages which wrap aligners that you might consider using: > > * gmapR > http://bioconductor.org/packages/release/bioc/html/gmapR.html > > * Rsubread: > http://bioconductor.org/packages/release/bioc/html/Rsubread.html > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago Stephen Henderson ★ 1.0k

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.8 years ago

Sorry The second I sent my initial post I realised I should have said Biostrings package for the vmatchPattern function. Not BSgenome (which is the subject). Fortunately for me you seem to maintain both. So apologies for any confusion. Stephen ________________________________________ From: Stephen [guest] <guest@bioconductor.org> Sent: 14 December 2013 14:00 To: bioconductor at r-project.org; Henderson, Stephen Cc: BSgenome Maintainer Subject: matchPattern vmatchPattern vectorised Hi I am trying to write a package that will make a few shortcuts for my lazy coworkers. So I wrote a few bits of code that will find their primers in amongst a fastq of multiplexed reads (e.g 10-20). Next I thought I would save them the trouble of copy pasting Primers, Chromosome, and Start into a shell script, by instead autogenerating the script - We have the excellent BSgenome and Mmusculus9 packages installed so this seems a good starting point: So for the first primer this works well: > system.time(vmatchPattern("CCAGCACTGTATAGCCGATC", Mmusculus)) user system elapsed 45.853 2.702 50.273 This is fine for a single primer but it seems from the docs (and testing) that if I want to lookup 15 primers it will take 15 passes through the genome and 15x as long. About the same time it would take them to just copy them from their lab-books. I guess they could have a coffee...still... My first question: Is there another function or package on BioC that I have missed that might help me with this? Or low level functions I should look at to build a vectorised search (exact match) through Mmusculus? And second I guess is a feature suggestion: Why not allow matchPattern to pass once through the genome comparing a set(char vector , DNAStringSet etc) to the subject? This seems to require little extra computational load (I think). And given the difficulty of using BLAST within R might be very useful extension. thx Stephen -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BSgenome_1.30.0 Biostrings_2.30.1 [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 [7] BiocGenerics_0.8.0 data.table_1.8.10 dplyr_0.1 [10] hflights_0.1 Rcpp_0.10.6 loaded via a namespace (and not attached): [1] assertthat_0.1 devtools_1.4.1 digest_0.6.4 evaluate_0.5.1 formatR_0.10 httr_0.2 knitr_1.5 [8] memoise_0.1 RCurl_1.95-4.1 stats4_3.0.2 stringr_0.6.2 tools_3.0.2 whisker_0.3-2 -- Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 12.1 years ago Stephen Henderson ★ 1.0k

Login before adding your answer.