Manipulating large DNAStringSet(s)
1
0
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States
Hi all, I have a rather large DNAStringSet you could get from, say, subselecting many intervals on Hsapiens$chr1 and I want to go through the paces of reverseComplement-ing some of the ranges that correspond to reads on the anti-sense strand. I'm finding that doing this the "naive" way is taking a lot of time and memory, but adding some extra code to change my target to a character vector (from a DNAStringSet) with some careful subsetting and reverseComplement-ing is much faster. So I'm wondering if I'm either doing something wrong, or perhaps something in DNAStringSet can be optimized to deal with this? Like so: R> library(BSgenome.Hsapiens.UCSC.hg18) R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, length.out=2141404), width=15) R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE) R> is.pos <- strands == '+' R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in the tail here I now want to reverseComplement the strings in v where strands == '-' R> strings <- DNAStringSet(v) R> strings[!is.pos] <- reverseComplement(strings[!is.pos]) That above command takes a very long time to complete ... it's actually been over a few minutes on my machine, so I'm killing it. Now let's do it another way, using normal character vectors and stuff R> strings <- DNAStringSet(v) R> result <- character(length(strings)) R> result[is.pos] <- as.character(strings[is.pos]) R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos])) ## and optionally ## R> result <- DNAStringSet(result) Just curious if there's a better way? R> sessionInfo() R version 2.11.0 (2010-04-22) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0 Biostrings_2.16.0 GenomicRanges_1.0.1 [5] IRanges_1.6.0 loaded via a namespace (and not attached): [1] Biobase_2.8.0 -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
GO Cancer BSgenome BSgenome GO Cancer BSgenome BSgenome • 984 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 4 days ago
Seattle, WA, United States
Hi Steve, Thanks for reporting this! I fixed this problem in IRanges 1.6.1 (release) and 1.7.1 (devel). Both will be available thru biocLite() in the next 24 hours or so. library(BSgenome.Hsapiens.UCSC.hg18) ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, length.out=2141404), width=15) strands <- sample(c('+', '-'), length(ranges), replace=TRUE) is.pos <- strands == '+' v <- Views(Hsapiens$chr1, ranges) strings <- DNAStringSet(v) ## Takes about 2.5 sec on my laptop: strings[!is.pos] <- reverseComplement(strings[!is.pos]) Cheers, H. Steve Lianoglou wrote: > Hi all, > > I have a rather large DNAStringSet you could get from, say, > subselecting many intervals on Hsapiens$chr1 and I want to go through > the paces of reverseComplement-ing some of the ranges that correspond > to reads on the anti-sense strand. > > I'm finding that doing this the "naive" way is taking a lot of time > and memory, but adding some extra code to change my target to a > character vector (from a DNAStringSet) with some careful subsetting > and reverseComplement-ing is much faster. > > So I'm wondering if I'm either doing something wrong, or perhaps > something in DNAStringSet can be optimized to deal with this? > > Like so: > > R> library(BSgenome.Hsapiens.UCSC.hg18) > R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, > length.out=2141404), width=15) > R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE) > R> is.pos <- strands == '+' > R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in > the tail here > > I now want to reverseComplement the strings in v where strands == '-' > > R> strings <- DNAStringSet(v) > R> strings[!is.pos] <- reverseComplement(strings[!is.pos]) > > That above command takes a very long time to complete ... it's > actually been over a few minutes on my machine, so I'm killing it. > > Now let's do it another way, using normal character vectors and stuff > > R> strings <- DNAStringSet(v) > R> result <- character(length(strings)) > R> result[is.pos] <- as.character(strings[is.pos]) > R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos])) > ## and optionally ## > R> result <- DNAStringSet(result) > > Just curious if there's a better way? > > R> sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0 > Biostrings_2.16.0 GenomicRanges_1.0.1 > [5] IRanges_1.6.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Hi Herv?, Thanks for coming through once again! -steve 2010/5/5 Hervé Pagès <hpages at="" fhcrc.org="">: > Hi Steve, > > Thanks for reporting this! I fixed this problem in IRanges 1.6.1 > (release) and 1.7.1 (devel). Both will be available thru biocLite() > in the next 24 hours or so. > > library(BSgenome.Hsapiens.UCSC.hg18) > ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, > length.out=2141404), width=15) > strands <- sample(c('+', '-'), length(ranges), replace=TRUE) > is.pos <- strands == '+' > v <- Views(Hsapiens$chr1, ranges) > > strings <- DNAStringSet(v) > > ## Takes about 2.5 sec on my laptop: > strings[!is.pos] <- reverseComplement(strings[!is.pos]) > > Cheers, > H. > > Steve Lianoglou wrote: >> >> Hi all, >> >> I have a rather large DNAStringSet you could get from, say, >> subselecting many intervals on Hsapiens$chr1 and I want to go through >> the paces of reverseComplement-ing some of the ranges that correspond >> to reads on the anti-sense strand. >> >> I'm finding that doing this the "naive" way is taking a lot of time >> and memory, but adding some extra code to change my target to a >> character vector (from a DNAStringSet) with some careful subsetting >> and reverseComplement-ing is much faster. >> >> So I'm wondering if I'm either doing something wrong, or perhaps >> something in DNAStringSet can be optimized to deal with this? >> >> Like so: >> >> R> library(BSgenome.Hsapiens.UCSC.hg18) >> R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, >> length.out=2141404), width=15) >> R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE) >> R> is.pos <- strands == '+' >> R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in >> the tail here >> >> I now want to reverseComplement the strings in v where strands == '-' >> >> R> strings <- DNAStringSet(v) >> R> strings[!is.pos] <- reverseComplement(strings[!is.pos]) >> >> That above command takes a very long time to complete ... it's >> actually been over a few minutes on my machine, so I'm killing it. >> >> Now let's do it another way, using normal character vectors and stuff >> >> R> strings <- DNAStringSet(v) >> R> result <- character(length(strings)) >> R> result[is.pos] <- as.character(strings[is.pos]) >> R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos])) >> ## and optionally ## >> R> result <- DNAStringSet(result) >> >> Just curious if there's a better way? >> >> R> sessionInfo() >> R version 2.11.0 (2010-04-22) >> x86_64-apple-darwin9.8.0 >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> other attached packages: >> [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0 >> ? Biostrings_2.16.0 ? ? ? ? ? ? ? ? ?GenomicRanges_1.0.1 >> [5] IRanges_1.6.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY

Login before adding your answer.

Traffic: 827 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6