singe_sequences.fa.gz file in Bsgenome.Hsapiens.NCBI.GRCh38 is too big
1
0
Entering edit mode
@herve-pages-1542
Last seen 11 hours ago
Seattle, WA, United States
Hi Sean, On 04/15/2014 11:30 PM, Sean Li [guest] wrote: > > singe_sequences.fa.gz file in Bsgenome.Hsapiens.NCBI.GRCh38 is too big to load. Why can you separate it into several files as Bsgenome.Hsapiens.UCSC.hg19 do? How are you trying to access the genome sequences in BSgenome.Hsapiens.NCBI.GRCh38? Note that the singe_sequences.fa.gz file is the package internal business and you should avoid trying to access it directly. The "normal" way to access the genome sequences is via [[ or getSeq(). Use [[ to load a given chromosome: genome <- Bsgenome.Hsapiens.NCBI.GRCh38 genome[["1"]] Use getSeq() to extract a set of regions (typically specified via a GRanges object). Trying to load the entire genome will require that R is able to allocate more than 3Gb of RAM which I don't think is possible on your platform (32-bit Windows). That's just the size of the Human genome once in memory (i.e. in a DNAStringSet object) and whatever format is used to store it on disk (a single file or 1 file per chromosome) won't change that. Anyway, because of other issues with singe_sequences.fa.gz, today BSgenome.Hsapiens.NCBI.GRCh38 will be updated with a new version that uses one file per chromosome. Cheers, H. > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: i386-w64-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=Chinese_People's Republic of China.936 > [2] LC_CTYPE=Chinese_People's Republic of China.936 > [3] LC_MONETARY=Chinese_People's Republic of China.936 > [4] LC_NUMERIC=C > [5] LC_TIME=Chinese_People's Republic of China.936 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Cancer BSgenome BSgenome Cancer BSgenome BSgenome • 752 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 11 hours ago
Seattle, WA, United States
Hi again, I got a little bit confused and didn't realize that I was answering such an old post (from April) and that you are also the person who reported the following issues on the bioc-devel list in April (the 2nd issue forwarded to the list by Michael): https://stat.ethz.ch/pipermail/bioc-devel/2014-April/005570.html https://stat.ethz.ch/pipermail/bioc-devel/2014-April/005591.html I hope all will be fine now with the BSgenome packages update. Please let me know if you still run into issues with the new packages (version 1.3.1000 or higher). Thanks, H. On 06/18/2014 11:34 AM, Hervé Pagès wrote: > Hi Sean, > > On 04/15/2014 11:30 PM, Sean Li [guest] wrote: >> >> singe_sequences.fa.gz file in Bsgenome.Hsapiens.NCBI.GRCh38 is too big >> to load. Why can you separate it into several files as >> Bsgenome.Hsapiens.UCSC.hg19 do? > > How are you trying to access the genome sequences in > BSgenome.Hsapiens.NCBI.GRCh38? > > Note that the singe_sequences.fa.gz file is the package internal > business and you should avoid trying to access it directly. The > "normal" way to access the genome sequences is via [[ or getSeq(). > Use [[ to load a given chromosome: > > genome <- Bsgenome.Hsapiens.NCBI.GRCh38 > genome[["1"]] > > Use getSeq() to extract a set of regions (typically specified via > a GRanges object). > > Trying to load the entire genome will require that R is able to allocate > more than 3Gb of RAM which I don't think is possible on your platform > (32-bit Windows). That's just the size of the Human genome once in > memory (i.e. in a DNAStringSet object) and whatever format is used to > store it on disk (a single file or 1 file per chromosome) won't change > that. > > Anyway, because of other issues with singe_sequences.fa.gz, today > BSgenome.Hsapiens.NCBI.GRCh38 will be updated with a new version that > uses one file per chromosome. > > Cheers, > H. > >> >> -- output of sessionInfo(): >> >> R version 3.1.0 (2014-04-10) >> Platform: i386-w64-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=Chinese_People's Republic of China.936 >> [2] LC_CTYPE=Chinese_People's Republic of China.936 >> [3] LC_MONETARY=Chinese_People's Republic of China.936 >> [4] LC_NUMERIC=C >> [5] LC_TIME=Chinese_People's Republic of China.936 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6