Question

Manipulating large DNAStringSet(s)

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 13 months ago

United States

Hi all, I have a rather large DNAStringSet you could get from, say, subselecting many intervals on Hsapiens$chr1 and I want to go through the paces of reverseComplement-ing some of the ranges that correspond to reads on the anti-sense strand. I'm finding that doing this the "naive" way is taking a lot of time and memory, but adding some extra code to change my target to a character vector (from a DNAStringSet) with some careful subsetting and reverseComplement-ing is much faster. So I'm wondering if I'm either doing something wrong, or perhaps something in DNAStringSet can be optimized to deal with this? Like so: R> library(BSgenome.Hsapiens.UCSC.hg18) R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, length.out=2141404), width=15) R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE) R> is.pos <- strands == '+' R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in the tail here I now want to reverseComplement the strings in v where strands == '-' R> strings <- DNAStringSet(v) R> strings[!is.pos] <- reverseComplement(strings[!is.pos]) That above command takes a very long time to complete ... it's actually been over a few minutes on my machine, so I'm killing it. Now let's do it another way, using normal character vectors and stuff R> strings <- DNAStringSet(v) R> result <- character(length(strings)) R> result[is.pos] <- as.character(strings[is.pos]) R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos])) ## and optionally ## R> result <- DNAStringSet(result) Just curious if there's a better way? R> sessionInfo() R version 2.11.0 (2010-04-22) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0 Biostrings_2.16.0 GenomicRanges_1.0.1 [5] IRanges_1.6.0 loaded via a namespace (and not attached): [1] Biobase_2.8.0 -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

GO Cancer BSgenome BSgenome GO Cancer BSgenome BSgenome • 984 views

ADD COMMENT • link updated 14.0 years ago by Hervé Pagès 16k • written 14.0 years ago by Steve Lianoglou ★ 13k

score 0 · Answer 1 · 2010-05-05

Hi Steve, Thanks for reporting this! I fixed this problem in IRanges 1.6.1 (release) and 1.7.1 (devel). Both will be available thru biocLite() in the next 24 hours or so. library(BSgenome.Hsapiens.UCSC.hg18) ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, length.out=2141404), width=15) strands <- sample(c('+', '-'), length(ranges), replace=TRUE) is.pos <- strands == '+' v <- Views(Hsapiens$chr1, ranges) strings <- DNAStringSet(v) ## Takes about 2.5 sec on my laptop: strings[!is.pos] <- reverseComplement(strings[!is.pos]) Cheers, H. Steve Lianoglou wrote: > Hi all, > > I have a rather large DNAStringSet you could get from, say, > subselecting many intervals on Hsapiens$chr1 and I want to go through > the paces of reverseComplement-ing some of the ranges that correspond > to reads on the anti-sense strand. > > I'm finding that doing this the "naive" way is taking a lot of time > and memory, but adding some extra code to change my target to a > character vector (from a DNAStringSet) with some careful subsetting > and reverseComplement-ing is much faster. > > So I'm wondering if I'm either doing something wrong, or perhaps > something in DNAStringSet can be optimized to deal with this? > > Like so: > > R> library(BSgenome.Hsapiens.UCSC.hg18) > R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, > length.out=2141404), width=15) > R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE) > R> is.pos <- strands == '+' > R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in > the tail here > > I now want to reverseComplement the strings in v where strands == '-' > > R> strings <- DNAStringSet(v) > R> strings[!is.pos] <- reverseComplement(strings[!is.pos]) > > That above command takes a very long time to complete ... it's > actually been over a few minutes on my machine, so I'm killing it. > > Now let's do it another way, using normal character vectors and stuff > > R> strings <- DNAStringSet(v) > R> result <- character(length(strings)) > R> result[is.pos] <- as.character(strings[is.pos]) > R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos])) > ## and optionally ## > R> result <- DNAStringSet(result) > > Just curious if there's a better way? > > R> sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0 > Biostrings_2.16.0 GenomicRanges_1.0.1 > [5] IRanges_1.6.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319