Entering edit mode
Michael Dondrup
▴
550
@michael-dondrup-3849
Last seen 10.2 years ago
Thank you Martin,
the fact that write.XStringSet has been improved so much is really an
important feature. So I am very glad to hear this.
I got the idea from here:
http://biostar.stackexchange.com/questions/1853/code-golf-digesting-
fasta-sequences-into-a-set-of-smaller-sequences
However, I am using this function to write out real-world data and
was startled by the fact that it could break
would the data become larger, hence my question.
I also tested some methods in pure-R mentioned by you, maybe this is
useful to someone until the next version of Bioconductor is released:
1.
> http://www.mail-archive.com/bioc-sig-sequencing at
r-project.org/msg01135.html
This solution was in fact too slow for my purpose, possibly because it
uses for-loops and type conversions.
2. this solution was the fastest one in my case:
writeFASTA <- function(dna, desc=names(dna), file=stdout()) {
if (is.null(desc))
desc <- paste(seq(along=dna))
fasta = character(2 * length(dna))
fasta[c(TRUE, FALSE)] <- paste(">", desc, sep="")
fasta[c(FALSE, TRUE)] <- as.character(dna)
writeLines(fasta, file)
}
The downside of this function is it does not wrap long sequence lines.
So I came up with another one:
3.
writeFASTA <- function(sequences, desc=names(sequences), width=80,
file=stdout(), append=FALSE) {
sequences <- as.character(sequences)
if(!is.null(desc) && length(sequences) != length(desc))
stop("wrong length of 'desc'")
desc <- if (is.null(desc))
paste(">", seq(along=sequences))
else
paste(">", desc, "\n", sep="")
## Adjust line width or add a \n
sequences <- if (! is.null(width))
gsub( sprintf("(.{1,%d})", width), "\\1\n", sequences )
else
paste(sequences, "\n", sep="")
cat(paste(desc,sequences, sep="", collapse=""), sep="", file=file,
append=append )
}
Not totally fool-proof, but still much faster than solution 1.
Best
Michael
Am Jul 27, 2010 um 3:26 PM schrieb Martin Morgan:
> On 07/27/2010 04:56 AM, Michael Dondrup wrote:
>> Hi,
>>
>> I was trying to use write.XStringView on a larger dataset but to no
avail. It seems like it is not implemented
>> efficiently. What I am trying is:
>>
>> I downloaded
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz
>>
>>> library(Biostrings)
>>> dnasts <- read.DNAStringSet(file="chr1.fa")
>
> Hi Michael --
>
> This is also
>
> library(BSgenome.Hsapiens.UCSC.hg18)
> Hsapiens
> chr1 = unmasked(Hsapiens[["chr1"]])
>
>> # break up the fasta file into segments of size 60
>>> dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]),
60), width=60)
>
> ... and
>
> dnaviews <-
> Views(chr1, successiveIRanges(rep(60, ceiling(length(chr1) /
60))))
>
>>> write.XStringViews(dnaviews, file="out.fa")
>
>> system.time(write.XStringSet(as(dnaviews, "DNAStringSet"),
> file=tempfile()))
> user system elapsed
> 7.024 0.756 8.030
>
>
> but this is with
>
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-07-20 r52579)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.17.6
> [3] Biostrings_2.17.24 GenomicRanges_1.1.17
> [5] IRanges_1.7.12
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.0 tools_2.12.0
>
>
>> ... I interrupted the process after 1h reaching a memory peak of
over 3GB.
>> In principle doing the whole task should not take longer than a few
seconds. I found this report:
>> https://stat.ethz.ch/pipermail/bioc-sig-
sequencing/2010-April/001160.html
>> I guess that is the same problem? Has there been any progress?
>
> so yes, there is progress but it requires use of the 'devel' version
of
> R and Bioconductor.
>
> There were a couple of other posts in that thread
>
> fasta = character(2 * length(dna))
> fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="")
> fasta[c(FALSE, TRUE)] = as.character(dna)
> writeLines(fasta, fl)
>
> and the more complete patch that seemed not to make it to the
mailing
> list directly but that is in
> http://www.mail-archive.com/bioc-sig-sequencing at
r-project.org/msg01135.html
>
> I wonder what you're going to do with your fasta file now?
>
> Hope that helps,
>
> Martin
>
>>
>> Is there probably a more efficient way of implementing this, e.g.
using cat()?
>>
>> Thanks a lot
>> Michael
>>
>>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods
base
>>
>> other attached packages:
>> [1] Biostrings_2.16.9 IRanges_1.6.8
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793