Search
Question: Thread safety of BSgenome getSeq()
0
gravatar for Jeff Johnston
4.5 years ago by
United States
Jeff Johnston90 wrote:
Hi, I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU: #------------------ library(GenomicRanges) library(BSgenome.Dmelanogaster.UCSC.dm3) gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") gr.list <- lapply(1:6, function(i) gr ) seqs.list <- mclapply(gr.list, function(gr) { message("getSeq() started") s <- getSeq(Dmelanogaster, gr) # does not reliably return if mc.cores > 1 message("getSeq() returned") s }, mc.cores=2) #------------------ If I instead load the BSgenome package inside the parallelized function everything is fine: #------------------ library(GenomicRanges) library(BSgenome.Dmelanogaster.UCSC.dm3) gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE) gr.list <- lapply(1:6, function(i) gr ) seqs.list <- mclapply(gr.list, function(gr) { library(BSgenome.Dmelanogaster.UCSC.dm3) message("getSeq() started") s <- getSeq(Dmelanogaster, gr) # always works message("getSeq() returned") s }, mc.cores=2) #------------------ I can reproduce this issue on both Mac and Linux (both 64-bit). Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()? Thanks, Jeff Johnston Zeitlinger Lab Stowers Institute for Medical Research #------------------ > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0 Biostrings_2.32.0 XVector_0.4.0 [5] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.8 BiocGenerics_0.10.0 [9] setwidth_1.0-3 loaded via a namespace (and not attached): [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0 zlibbioc_1.10.0 #------------------
ADD COMMENTlink modified 4.4 years ago by Hervé Pagès ♦♦ 13k • written 4.5 years ago by Jeff Johnston90
0
gravatar for Hervé Pagès
4.4 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
Hi Jeff, Sorry for the delay in answering. Thanks for reporting this. Martin Morgan was able to reproduce and found out that the problem is originating in the razf C code from Samtools (which is included in the Rsamtools package). The razf code provides the low level functionality for writing/reading RAZipp'ed compressed FASTA files, which is the format used in BioC 2.14 for storing the genomic sequences contained in the BSgenome data packages. Other issues have been reported with these RAZip based BSgenome packages, and, more importantly, it seems that the original Samtools developers have moved to other projects and are not going to maintain the razf code anymore. So this new issue you found convinced me that we should stop using the RAZip format in the BSgenome data packages. I'm currently in the process of rolling them back to the old on-disk format (i.e. 1 serialized DNAString object per chromosome). Expect them to show up in the BioC package repos in the next couple of days. They'll be at version 1.3.1000 (source packages will show up today, binary packages will follow). Also note that in devel (BioC 3.0), the BSgenome data packages use the 2bit format (from UCSC), which is superior in any aspect to the previous on-disk formats (i.e. in terms of speed, memory usage, and size on disk). Unfortunately I cannot use this format for the BSgenome data packages in release because that relies on functionalities in the BSgenome *software* package that are only available in BioC devel. Let me know if you have any questions or concerns about this. Cheers, H. On 06/06/2014 01:15 PM, Johnston, Jeffrey wrote: > Hi, > > I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU: > > #------------------ > library(GenomicRanges) > library(BSgenome.Dmelanogaster.UCSC.dm3) > gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") > gr.list <- lapply(1:6, function(i) gr ) > > seqs.list <- mclapply(gr.list, function(gr) { > message("getSeq() started") > s <- getSeq(Dmelanogaster, gr) # does not reliably return if mc.cores > 1 > message("getSeq() returned") > s > }, mc.cores=2) > #------------------ > > If I instead load the BSgenome package inside the parallelized function everything is fine: > > #------------------ > library(GenomicRanges) > library(BSgenome.Dmelanogaster.UCSC.dm3) > gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") > detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE) > gr.list <- lapply(1:6, function(i) gr ) > > seqs.list <- mclapply(gr.list, function(gr) { > library(BSgenome.Dmelanogaster.UCSC.dm3) > message("getSeq() started") > s <- getSeq(Dmelanogaster, gr) # always works > message("getSeq() returned") > s > }, mc.cores=2) > #------------------ > > I can reproduce this issue on both Mac and Linux (both 64-bit). > > Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()? > > Thanks, > Jeff Johnston > Zeitlinger Lab > Stowers Institute for Medical Research > > #------------------ >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0 Biostrings_2.32.0 XVector_0.4.0 > [5] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.8 BiocGenerics_0.10.0 > [9] setwidth_1.0-3 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0 zlibbioc_1.10.0 > #------------------ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENTlink written 4.4 years ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 365 users visited in the last hour