Thread safety of BSgenome getSeq()
1
0
Entering edit mode
@jeff-johnston-6497
Last seen 4.7 years ago
United States
Thanks for looking into this Herv? and Martin, and for the quick fix. It is working as expected now. I did have to install the updated packages from source as the binaries do not seem to be updating for Mac or Linux. Thanks, Jeff On Jun 18, 2014, at 1:14 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Jeff, > > Sorry for the delay in answering. > > Thanks for reporting this. Martin Morgan was able to reproduce and > found out that the problem is originating in the razf C code from > Samtools (which is included in the Rsamtools package). The razf code > provides the low level functionality for writing/reading RAZipp'ed > compressed FASTA files, which is the format used in BioC 2.14 for > storing the genomic sequences contained in the BSgenome data packages. > > Other issues have been reported with these RAZip based BSgenome > packages, and, more importantly, it seems that the original Samtools > developers have moved to other projects and are not going to maintain > the razf code anymore. > > So this new issue you found convinced me that we should stop using > the RAZip format in the BSgenome data packages. I'm currently in the > process of rolling them back to the old on-disk format (i.e. 1 > serialized DNAString object per chromosome). Expect them to show up in > the BioC package repos in the next couple of days. They'll be at > version 1.3.1000 (source packages will show up today, binary packages > will follow). > > Also note that in devel (BioC 3.0), the BSgenome data packages use the > 2bit format (from UCSC), which is superior in any aspect to the > previous on-disk formats (i.e. in terms of speed, memory usage, and > size on disk). Unfortunately I cannot use this format for the BSgenome > data packages in release because that relies on functionalities in > the BSgenome *software* package that are only available in BioC devel. > > Let me know if you have any questions or concerns about this. > > Cheers, > H. > > On 06/06/2014 01:15 PM, Johnston, Jeffrey wrote: >> Hi, >> >> I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU: >> >> #------------------ >> library(GenomicRanges) >> library(BSgenome.Dmelanogaster.UCSC.dm3) >> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") >> gr.list <- lapply(1:6, function(i) gr ) >> >> seqs.list <- mclapply(gr.list, function(gr) { >> message("getSeq() started") >> s <- getSeq(Dmelanogaster, gr) # does not reliably return if mc.cores > 1 >> message("getSeq() returned") >> s >> }, mc.cores=2) >> #------------------ >> >> If I instead load the BSgenome package inside the parallelized function everything is fine: >> >> #------------------ >> library(GenomicRanges) >> library(BSgenome.Dmelanogaster.UCSC.dm3) >> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") >> detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE) >> gr.list <- lapply(1:6, function(i) gr ) >> >> seqs.list <- mclapply(gr.list, function(gr) { >> library(BSgenome.Dmelanogaster.UCSC.dm3) >> message("getSeq() started") >> s <- getSeq(Dmelanogaster, gr) # always works >> message("getSeq() returned") >> s >> }, mc.cores=2) >> #------------------ >> >> I can reproduce this issue on both Mac and Linux (both 64-bit). >> >> Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()? >> >> Thanks, >> Jeff Johnston >> Zeitlinger Lab >> Stowers Institute for Medical Research >> >> #------------------ >>> sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0 Biostrings_2.32.0 XVector_0.4.0 >> [5] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.8 BiocGenerics_0.10.0 >> [9] setwidth_1.0-3 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0 zlibbioc_1.10.0 >> #------------------ >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
Cancer BSgenome PROcess BSgenome Rsamtools Cancer BSgenome PROcess BSgenome Rsamtools • 967 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 4 days ago
Seattle, WA, United States
Hi Jeff, On 07/01/2014 10:40 AM, Johnston, Jeffrey wrote: > Thanks for looking into this Herv? and Martin, and for the quick fix. It is working as expected now. I did have to install the updated packages from source as the binaries do not seem to be updating for Mac or Linux. Glad it works for you now. Yes we still need to generate the binaries for Windows and Mac. Will do soon. Thanks for the reminder! Cheers, H. > > Thanks, > Jeff > > On Jun 18, 2014, at 1:14 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > >> Hi Jeff, >> >> Sorry for the delay in answering. >> >> Thanks for reporting this. Martin Morgan was able to reproduce and >> found out that the problem is originating in the razf C code from >> Samtools (which is included in the Rsamtools package). The razf code >> provides the low level functionality for writing/reading RAZipp'ed >> compressed FASTA files, which is the format used in BioC 2.14 for >> storing the genomic sequences contained in the BSgenome data packages. >> >> Other issues have been reported with these RAZip based BSgenome >> packages, and, more importantly, it seems that the original Samtools >> developers have moved to other projects and are not going to maintain >> the razf code anymore. >> >> So this new issue you found convinced me that we should stop using >> the RAZip format in the BSgenome data packages. I'm currently in the >> process of rolling them back to the old on-disk format (i.e. 1 >> serialized DNAString object per chromosome). Expect them to show up in >> the BioC package repos in the next couple of days. They'll be at >> version 1.3.1000 (source packages will show up today, binary packages >> will follow). >> >> Also note that in devel (BioC 3.0), the BSgenome data packages use the >> 2bit format (from UCSC), which is superior in any aspect to the >> previous on-disk formats (i.e. in terms of speed, memory usage, and >> size on disk). Unfortunately I cannot use this format for the BSgenome >> data packages in release because that relies on functionalities in >> the BSgenome *software* package that are only available in BioC devel. >> >> Let me know if you have any questions or concerns about this. >> >> Cheers, >> H. >> >> On 06/06/2014 01:15 PM, Johnston, Jeffrey wrote: >>> Hi, >>> >>> I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU: >>> >>> #------------------ >>> library(GenomicRanges) >>> library(BSgenome.Dmelanogaster.UCSC.dm3) >>> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") >>> gr.list <- lapply(1:6, function(i) gr ) >>> >>> seqs.list <- mclapply(gr.list, function(gr) { >>> message("getSeq() started") >>> s <- getSeq(Dmelanogaster, gr) # does not reliably return if mc.cores > 1 >>> message("getSeq() returned") >>> s >>> }, mc.cores=2) >>> #------------------ >>> >>> If I instead load the BSgenome package inside the parallelized function everything is fine: >>> >>> #------------------ >>> library(GenomicRanges) >>> library(BSgenome.Dmelanogaster.UCSC.dm3) >>> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") >>> detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE) >>> gr.list <- lapply(1:6, function(i) gr ) >>> >>> seqs.list <- mclapply(gr.list, function(gr) { >>> library(BSgenome.Dmelanogaster.UCSC.dm3) >>> message("getSeq() started") >>> s <- getSeq(Dmelanogaster, gr) # always works >>> message("getSeq() returned") >>> s >>> }, mc.cores=2) >>> #------------------ >>> >>> I can reproduce this issue on both Mac and Linux (both 64-bit). >>> >>> Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()? >>> >>> Thanks, >>> Jeff Johnston >>> Zeitlinger Lab >>> Stowers Institute for Medical Research >>> >>> #------------------ >>>> sessionInfo() >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0 Biostrings_2.32.0 XVector_0.4.0 >>> [5] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.8 BiocGenerics_0.10.0 >>> [9] setwidth_1.0-3 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0 zlibbioc_1.10.0 >>> #------------------ >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 421 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6