write.XStringView/write.XStringSet highly inefficient (solved)

0

Entering edit mode

Michael Dondrup ▴ 550

@michael-dondrup-3849

Last seen 9.6 years ago

Thank you Martin, the fact that write.XStringSet has been improved so much is really an important feature. So I am very glad to hear this. I got the idea from here: http://biostar.stackexchange.com/questions/1853/code-golf-digesting- fasta-sequences-into-a-set-of-smaller-sequences However, I am using this function to write out real-world data and was startled by the fact that it could break would the data become larger, hence my question. I also tested some methods in pure-R mentioned by you, maybe this is useful to someone until the next version of Bioconductor is released: 1. > http://www.mail-archive.com/bioc-sig-sequencing at r-project.org/msg01135.html This solution was in fact too slow for my purpose, possibly because it uses for-loops and type conversions. 2. this solution was the fastest one in my case: writeFASTA <- function(dna, desc=names(dna), file=stdout()) { if (is.null(desc)) desc <- paste(seq(along=dna)) fasta = character(2 * length(dna)) fasta[c(TRUE, FALSE)] <- paste(">", desc, sep="") fasta[c(FALSE, TRUE)] <- as.character(dna) writeLines(fasta, file) } The downside of this function is it does not wrap long sequence lines. So I came up with another one: 3. writeFASTA <- function(sequences, desc=names(sequences), width=80, file=stdout(), append=FALSE) { sequences <- as.character(sequences) if(!is.null(desc) && length(sequences) != length(desc)) stop("wrong length of 'desc'") desc <- if (is.null(desc)) paste(">", seq(along=sequences)) else paste(">", desc, "\n", sep="") ## Adjust line width or add a \n sequences <- if (! is.null(width)) gsub( sprintf("(.{1,%d})", width), "\\1\n", sequences ) else paste(sequences, "\n", sep="") cat(paste(desc,sequences, sep="", collapse=""), sep="", file=file, append=append ) } Not totally fool-proof, but still much faster than solution 1. Best Michael Am Jul 27, 2010 um 3:26 PM schrieb Martin Morgan: > On 07/27/2010 04:56 AM, Michael Dondrup wrote: >> Hi, >> >> I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented >> efficiently. What I am trying is: >> >> I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz >> >>> library(Biostrings) >>> dnasts <- read.DNAStringSet(file="chr1.fa") > > Hi Michael -- > > This is also > > library(BSgenome.Hsapiens.UCSC.hg18) > Hsapiens > chr1 = unmasked(Hsapiens[["chr1"]]) > >> # break up the fasta file into segments of size 60 >>> dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60) > > ... and > > dnaviews <- > Views(chr1, successiveIRanges(rep(60, ceiling(length(chr1) / 60)))) > >>> write.XStringViews(dnaviews, file="out.fa") > >> system.time(write.XStringSet(as(dnaviews, "DNAStringSet"), > file=tempfile())) > user system elapsed > 7.024 0.756 8.030 > > > but this is with > >> sessionInfo() > R version 2.12.0 Under development (unstable) (2010-07-20 r52579) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.17.6 > [3] Biostrings_2.17.24 GenomicRanges_1.1.17 > [5] IRanges_1.7.12 > > loaded via a namespace (and not attached): > [1] Biobase_2.9.0 tools_2.12.0 > > >> ... I interrupted the process after 1h reaching a memory peak of over 3GB. >> In principle doing the whole task should not take longer than a few seconds. I found this report: >> https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2010-April/001160.html >> I guess that is the same problem? Has there been any progress? > > so yes, there is progress but it requires use of the 'devel' version of > R and Bioconductor. > > There were a couple of other posts in that thread > > fasta = character(2 * length(dna)) > fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="") > fasta[c(FALSE, TRUE)] = as.character(dna) > writeLines(fasta, fl) > > and the more complete patch that seemed not to make it to the mailing > list directly but that is in > http://www.mail-archive.com/bioc-sig-sequencing at r-project.org/msg01135.html > > I wonder what you're going to do with your fasta file now? > > Hope that helps, > > Martin > >> >> Is there probably a more efficient way of implementing this, e.g. using cat()? >> >> Thanks a lot >> Michael >> >>> sessionInfo() >> R version 2.11.1 (2010-05-31) >> x86_64-unknown-linux-gnu >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biostrings_2.16.9 IRanges_1.6.8 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793

Cancer BSgenome PROcess GLAD BSgenome Cancer BSgenome PROcess GLAD BSgenome • 1.3k views

ADD COMMENT • link updated 13.8 years ago by Steve Lianoglou ★ 13k • written 13.8 years ago by Michael Dondrup ▴ 550

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Wed, Jul 28, 2010 at 5:10 AM, Michael Dondrup <michael.dondrup at="" uni.no=""> wrote: <snip> > The downside of this function is it does not wrap long sequence lines. So I came up with another one: > > 3. > writeFASTA <- function(sequences, desc=names(sequences), width=80, file=stdout(), append=FALSE) { > ?sequences <- as.character(sequences) > ?if(!is.null(desc) && length(sequences) != length(desc)) > ? ? stop("wrong length of 'desc'") > ?desc <- if (is.null(desc)) > ? ?paste(">", seq(along=sequences)) > ?else > ? ?paste(">", desc, "\n", sep="") > ?## Adjust line width or add a \n > ?sequences <- ?if (! is.null(width)) > ? ?gsub( sprintf("(.{1,%d})", width), "\\1\n", sequences ) > ?else > ? ?paste(sequences, "\n", sep="") > ?cat(paste(desc,sequences, sep="", collapse=""), sep="", file=file, append=append ) > } Nice use of regex's! Reminds me of: http://xkcd.com/208/ ;-) -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 13.8 years ago Steve Lianoglou ★ 13k

Login before adding your answer.