write.XStringView/write.XStringSet highly inefficient (solved)
1
0
Entering edit mode
@michael-dondrup-3849
Last seen 9.6 years ago
Thank you Martin, the fact that write.XStringSet has been improved so much is really an important feature. So I am very glad to hear this. I got the idea from here: http://biostar.stackexchange.com/questions/1853/code-golf-digesting- fasta-sequences-into-a-set-of-smaller-sequences However, I am using this function to write out real-world data and was startled by the fact that it could break would the data become larger, hence my question. I also tested some methods in pure-R mentioned by you, maybe this is useful to someone until the next version of Bioconductor is released: 1. > http://www.mail-archive.com/bioc-sig-sequencing at r-project.org/msg01135.html This solution was in fact too slow for my purpose, possibly because it uses for-loops and type conversions. 2. this solution was the fastest one in my case: writeFASTA <- function(dna, desc=names(dna), file=stdout()) { if (is.null(desc)) desc <- paste(seq(along=dna)) fasta = character(2 * length(dna)) fasta[c(TRUE, FALSE)] <- paste(">", desc, sep="") fasta[c(FALSE, TRUE)] <- as.character(dna) writeLines(fasta, file) } The downside of this function is it does not wrap long sequence lines. So I came up with another one: 3. writeFASTA <- function(sequences, desc=names(sequences), width=80, file=stdout(), append=FALSE) { sequences <- as.character(sequences) if(!is.null(desc) && length(sequences) != length(desc)) stop("wrong length of 'desc'") desc <- if (is.null(desc)) paste(">", seq(along=sequences)) else paste(">", desc, "\n", sep="") ## Adjust line width or add a \n sequences <- if (! is.null(width)) gsub( sprintf("(.{1,%d})", width), "\\1\n", sequences ) else paste(sequences, "\n", sep="") cat(paste(desc,sequences, sep="", collapse=""), sep="", file=file, append=append ) } Not totally fool-proof, but still much faster than solution 1. Best Michael Am Jul 27, 2010 um 3:26 PM schrieb Martin Morgan: > On 07/27/2010 04:56 AM, Michael Dondrup wrote: >> Hi, >> >> I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented >> efficiently. What I am trying is: >> >> I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz >> >>> library(Biostrings) >>> dnasts <- read.DNAStringSet(file="chr1.fa") > > Hi Michael -- > > This is also > > library(BSgenome.Hsapiens.UCSC.hg18) > Hsapiens > chr1 = unmasked(Hsapiens[["chr1"]]) > >> # break up the fasta file into segments of size 60 >>> dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60) > > ... and > > dnaviews <- > Views(chr1, successiveIRanges(rep(60, ceiling(length(chr1) / 60)))) > >>> write.XStringViews(dnaviews, file="out.fa") > >> system.time(write.XStringSet(as(dnaviews, "DNAStringSet"), > file=tempfile())) > user system elapsed > 7.024 0.756 8.030 > > > but this is with > >> sessionInfo() > R version 2.12.0 Under development (unstable) (2010-07-20 r52579) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.17.6 > [3] Biostrings_2.17.24 GenomicRanges_1.1.17 > [5] IRanges_1.7.12 > > loaded via a namespace (and not attached): > [1] Biobase_2.9.0 tools_2.12.0 > > >> ... I interrupted the process after 1h reaching a memory peak of over 3GB. >> In principle doing the whole task should not take longer than a few seconds. I found this report: >> https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2010-April/001160.html >> I guess that is the same problem? Has there been any progress? > > so yes, there is progress but it requires use of the 'devel' version of > R and Bioconductor. > > There were a couple of other posts in that thread > > fasta = character(2 * length(dna)) > fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="") > fasta[c(FALSE, TRUE)] = as.character(dna) > writeLines(fasta, fl) > > and the more complete patch that seemed not to make it to the mailing > list directly but that is in > http://www.mail-archive.com/bioc-sig-sequencing at r-project.org/msg01135.html > > I wonder what you're going to do with your fasta file now? > > Hope that helps, > > Martin > >> >> Is there probably a more efficient way of implementing this, e.g. using cat()? >> >> Thanks a lot >> Michael >> >>> sessionInfo() >> R version 2.11.1 (2010-05-31) >> x86_64-unknown-linux-gnu >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biostrings_2.16.9 IRanges_1.6.8 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793
Cancer BSgenome PROcess GLAD BSgenome Cancer BSgenome PROcess GLAD BSgenome • 1.3k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi, On Wed, Jul 28, 2010 at 5:10 AM, Michael Dondrup <michael.dondrup at="" uni.no=""> wrote: <snip> > The downside of this function is it does not wrap long sequence lines. So I came up with another one: > > 3. > writeFASTA <- function(sequences, desc=names(sequences), width=80, file=stdout(), append=FALSE) { > ?sequences <- as.character(sequences) > ?if(!is.null(desc) && length(sequences) != length(desc)) > ? ? stop("wrong length of 'desc'") > ?desc <- if (is.null(desc)) > ? ?paste(">", seq(along=sequences)) > ?else > ? ?paste(">", desc, "\n", sep="") > ?## Adjust line width or add a \n > ?sequences <- ?if (! is.null(width)) > ? ?gsub( sprintf("(.{1,%d})", width), "\\1\n", sequences ) > ?else > ? ?paste(sequences, "\n", sep="") > ?cat(paste(desc,sequences, sep="", collapse=""), sep="", file=file, append=append ) > } Nice use of regex's! Reminds me of: http://xkcd.com/208/ ;-) -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT

Login before adding your answer.

Traffic: 707 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6