write.XStringView/write.XStringSet highly inefficient

0

Entering edit mode

Michael Dondrup ▴ 550

@michael-dondrup-3849

Last seen 11.3 years ago

Hi, I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented efficiently. What I am trying is: I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz > library(Biostrings) > dnasts <- read.DNAStringSet(file="chr1.fa") # break up the fasta file into segments of size 60 > dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60) > write.XStringViews(dnaviews, file="out.fa") ... I interrupted the process after 1h reaching a memory peak of over 3GB. In principle doing the whole task should not take longer than a few seconds. I found this report: https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2010-April/001160.html I guess that is the same problem? Has there been any progress? Is there probably a more efficient way of implementing this, e.g. using cat()? Thanks a lot Michael > sessionInfo() R version 2.11.1 (2010-05-31) x86_64-unknown-linux-gnu locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.16.9 IRanges_1.6.8 loaded via a namespace (and not attached): [1] Biobase_2.8.0 >

PROcess PROcess • 865 views

ADD COMMENT • link updated 15.4 years ago by Martin Morgan 25k • written 15.4 years ago by Michael Dondrup ▴ 550

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 10 months ago

United States

On 07/27/2010 04:56 AM, Michael Dondrup wrote: > Hi, > > I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented > efficiently. What I am trying is: > > I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz > >> library(Biostrings) >> dnasts <- read.DNAStringSet(file="chr1.fa") Hi Michael -- This is also library(BSgenome.Hsapiens.UCSC.hg18) Hsapiens chr1 = unmasked(Hsapiens[["chr1"]]) > # break up the fasta file into segments of size 60 >> dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60) ... and dnaviews <- Views(chr1, successiveIRanges(rep(60, ceiling(length(chr1) / 60)))) >> write.XStringViews(dnaviews, file="out.fa") > system.time(write.XStringSet(as(dnaviews, "DNAStringSet"), file=tempfile())) user system elapsed 7.024 0.756 8.030 but this is with > sessionInfo() R version 2.12.0 Under development (unstable) (2010-07-20 r52579) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.17.6 [3] Biostrings_2.17.24 GenomicRanges_1.1.17 [5] IRanges_1.7.12 loaded via a namespace (and not attached): [1] Biobase_2.9.0 tools_2.12.0 > ... I interrupted the process after 1h reaching a memory peak of over 3GB. > In principle doing the whole task should not take longer than a few seconds. I found this report: > https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2010-April/001160.html > I guess that is the same problem? Has there been any progress? so yes, there is progress but it requires use of the 'devel' version of R and Bioconductor. There were a couple of other posts in that thread fasta = character(2 * length(dna)) fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="") fasta[c(FALSE, TRUE)] = as.character(dna) writeLines(fasta, fl) and the more complete patch that seemed not to make it to the mailing list directly but that is in http://www.mail-archive.com/bioc-sig-sequencing at r-project.org/msg01135.html I wonder what you're going to do with your fasta file now? Hope that helps, Martin > > Is there probably a more efficient way of implementing this, e.g. using cat()? > > Thanks a lot > Michael > >> sessionInfo() > R version 2.11.1 (2010-05-31) > x86_64-unknown-linux-gnu > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.16.9 IRanges_1.6.8 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 15.4 years ago Martin Morgan 25k

Login before adding your answer.