Question

how to operate on a DNAStringSet object

0

Entering edit mode

Chris Seidel ▴ 80

@chris-seidel-5840

Last seen 2.9 years ago

United States

I can't figure out how to operate on a DNAStringSet object, and get a DNAStringSet object back. I have a DNAStringSet object for which I'd like to randomize the order of nucleotides in the sequences. Here is a simple example: # function to randomly re-order a string randomizeSeq <- function(mystring){ # split the string into characters myChars <- unlist(strsplit(as.character(mystring),'')) # randomize the order of characters random.myChars <- sample(myChars,length(myChars)) # concatenate them back into a string and return the result DNAStringSet(paste(random.myChars,collapse='')) } # create some sequences seq1 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="") seq2 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="") # make a DNAStringSet object myseqs <- DNAStringSet(c(seq1,seq2)) names(myseqs) <- c("s1", "s2") # apply the randomize function to the DNAStringSet object myRandomizedseqs <- sapply(myseqs,randomizeSeq) # The result is a list, try to make a DNAStringSetObject out of it DNAStringSet(unlist(myRandomizedseqs)) # that doesn't work, try something else DNAStringSet(do.call(c,unlist(myRandomizedseqs))) # that doesn't work. I gave up (after several hours) and wrote a for loop, which works but takes forever on my sequence set, and makes me feel stupid. What's odd, is that this actually works: DNAStringSet(do.call(c,unlist(myRandomizedseqs))) *IF* the sequences are NOT NAMED. How does one operate on the sequences of a DNAStringSet object without getting a list back, or without a for loop? I'm sure there's some elegant one-liner that completely escapes me. > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.19 BSgenome_1.26.1 Biostrings_2.26.2 GenomicRanges_1.10.5 [5] IRanges_1.16.4 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] parallel_2.15.2 stats4_2.15.2 tools_2.15.2

BSgenome BSgenome BSgenome BSgenome • 8.9k views

ADD COMMENT • link 11.1 years ago Chris Seidel ▴ 80

score 0 · Answer 1 · 2013-03-21

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 13 months ago

United States

Hi, On Thu, Mar 21, 2013 at 4:48 PM, Chris Seidel <seidel at="" phaget4.org=""> wrote: [aggressive clipping] > What's odd, is that this actually works: > > DNAStringSet(do.call(c,unlist(myRandomizedseqs))) > > *IF* the sequences are NOT NAMED. This (or similar things) have come up before on the ML, but I don't have time to search for it right now. I posted a suggestion that I use "unname" defensively to sidestep these corner cases. Perhaps that will help you find the thread when searching the archives. In any event, you could do: R> DNAStringSet(do.call(c, unname(unlist(...)))) Now that I look at your example, I think the thread I'm talking about might have been slightly different, but I guess this should still work in your case. > How does one operate on the sequences of a DNAStringSet object without > getting a list back, or without a for loop? I'm sure there's some > elegant one-liner that completely escapes me. To randomize the sequences, you could do: R> xx <- DNAStringSet(c("GATACA", "GATCCTAA")) R> endoapply(xx, sample) A DNAStringSet instance of length 2 width seq [1] 6 ACGATA [2] 8 GTCATAAC Where did that come from, right? Note that a DNAStringSet is an IRanges::Vector, and you'll find lots of things in the IRangesOverview vignette, which at first might seem like to long/detailed to read, but will be worth your time. Not sure how fast this will be on large XStringSet object, though. You may not buy yourself more speed than the for loop, but can't test that right now. Perhaps lapply(DNAstringSet, sample) might be faster, but I'll leave this as an exercise for the reader. HTH, -steve -- Steve Lianoglou Defender of The Thesis | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 11.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

And upon one more second's worth of inspection, the endoapply suggestion actually is a for-loop under the covers, so you won't be buying time ... I guess the lapply will go faster ... -steve On Thu, Mar 21, 2013 at 5:05 PM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi, > > On Thu, Mar 21, 2013 at 4:48 PM, Chris Seidel <seidel at="" phaget4.org=""> wrote: > [aggressive clipping] > >> What's odd, is that this actually works: >> >> DNAStringSet(do.call(c,unlist(myRandomizedseqs))) >> >> *IF* the sequences are NOT NAMED. > > This (or similar things) have come up before on the ML, but I don't > have time to search for it right now. I posted a suggestion that I use > "unname" defensively to sidestep these corner cases. Perhaps that will > help you find the thread when searching the archives. In any event, > you could do: > > R> DNAStringSet(do.call(c, unname(unlist(...)))) > > Now that I look at your example, I think the thread I'm talking about > might have been slightly different, but I guess this should still work > in your case. > >> How does one operate on the sequences of a DNAStringSet object without >> getting a list back, or without a for loop? I'm sure there's some >> elegant one-liner that completely escapes me. > > To randomize the sequences, you could do: > > R> xx <- DNAStringSet(c("GATACA", "GATCCTAA")) > R> endoapply(xx, sample) > A DNAStringSet instance of length 2 > width seq > [1] 6 ACGATA > [2] 8 GTCATAAC > > Where did that come from, right? > > Note that a DNAStringSet is an IRanges::Vector, and you'll find lots > of things in the IRangesOverview vignette, which at first might seem > like to long/detailed to read, but will be worth your time. > > Not sure how fast this will be on large XStringSet object, though. You > may not buy yourself more speed than the for loop, but can't test that > right now. Perhaps lapply(DNAstringSet, sample) might be faster, but > I'll leave this as an exercise for the reader. > > HTH, > -steve > > -- > Steve Lianoglou > Defender of The Thesis > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Steve Lianoglou Defender of The Thesis | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Chris, Steve, I tried Steve's 'endoapply(xx, sample)' and was surprised/disappointed that it's about 10x slower than doing lapply() :-/ The reason for that is that it was actually calling the "endoapply" method for List objects, which is very inefficient. I just added an "endoapply" method for XVectorList objects (DNAStringSet objects are just particular XVectorList objects) that is as fast as doing lapply(): > library(Biostrings) > library(hgu95av2probe) > probes <- DNAStringSet(hgu95av2probe) > system.time(probes2 <- endoapply(probes, sample)) user system elapsed 389.928 0.188 390.935 Still not particularly fast but 10x faster than before. This is in Biostrings 2.27.12 (devel). A faster (but more complicated) way to go is to use a 3-step approach: (1) unlist 'probes' (produces a single big DNAString object), (2) operate on the unlisted object, and (3) relist it: (1) Unlist: unlisted <- unlist(probes) (2) Shuffle the nucleotides in 'unlisted'. One complication is that the shuffling must not move nucleotides across the boundaries of the probes. This requires a little bit of extra work: probes_width <- width(probes) shuffling_idx <- IntegerList(lapply(probes_width, sample)) offsets <- cumsum(c(0L, probes_width[-length(probes_width)])) shuffling_idx <- unlist(shuffling_idx + offsets) ## Sanity check. stopifnot(identical(sort(shuffling_idx), seq_along(unlisted))) unlisted2 <- unlisted[shuffling_idx] (3) Relist. Unfortunately, there is no "relist' method for DNAString objects at the moment (I still need to add one). In the mean time we can do: probes2b <- as(successiveViews(unlisted2, width(probes)), "DNAStringSet") With this method, it takes only about 10 sec on my laptop to shuffle the nucleotides of all the probes in hgu95av2probe. If you are careful to set the RNG seed to the same value (via set.seed) before you run endoapply() and the 3-step approach, you'll end up with exactly the same result i.e. 'identical(as.character(probes2b), as.character(probes2))' will be TRUE. Cheers, H. On 03/21/2013 02:07 PM, Steve Lianoglou wrote: > And upon one more second's worth of inspection, the endoapply > suggestion actually is a for-loop under the covers, so you won't be > buying time ... I guess the lapply will go faster ... > > -steve > > On Thu, Mar 21, 2013 at 5:05 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> Hi, >> >> On Thu, Mar 21, 2013 at 4:48 PM, Chris Seidel <seidel at="" phaget4.org=""> wrote: >> [aggressive clipping] >> >>> What's odd, is that this actually works: >>> >>> DNAStringSet(do.call(c,unlist(myRandomizedseqs))) >>> >>> *IF* the sequences are NOT NAMED. >> >> This (or similar things) have come up before on the ML, but I don't >> have time to search for it right now. I posted a suggestion that I use >> "unname" defensively to sidestep these corner cases. Perhaps that will >> help you find the thread when searching the archives. In any event, >> you could do: >> >> R> DNAStringSet(do.call(c, unname(unlist(...)))) >> >> Now that I look at your example, I think the thread I'm talking about >> might have been slightly different, but I guess this should still work >> in your case. >> >>> How does one operate on the sequences of a DNAStringSet object without >>> getting a list back, or without a for loop? I'm sure there's some >>> elegant one-liner that completely escapes me. >> >> To randomize the sequences, you could do: >> >> R> xx <- DNAStringSet(c("GATACA", "GATCCTAA")) >> R> endoapply(xx, sample) >> A DNAStringSet instance of length 2 >> width seq >> [1] 6 ACGATA >> [2] 8 GTCATAAC >> >> Where did that come from, right? >> >> Note that a DNAStringSet is an IRanges::Vector, and you'll find lots >> of things in the IRangesOverview vignette, which at first might seem >> like to long/detailed to read, but will be worth your time. >> >> Not sure how fast this will be on large XStringSet object, though. You >> may not buy yourself more speed than the for loop, but can't test that >> right now. Perhaps lapply(DNAstringSet, sample) might be faster, but >> I'll leave this as an exercise for the reader. >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Defender of The Thesis >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.1 years ago Hervé Pagès 16k

0

Entering edit mode

Very slick, Hervé ... very slick indeed. On Friday, March 22, 2013, Hervé Pagès wrote: > Hi Chris, Steve, > > I tried Steve's 'endoapply(xx, sample)' and was surprised/disappointed > that it's about 10x slower than doing lapply() :-/ > The reason for that is that it was actually calling the "endoapply" > method for List objects, which is very inefficient. > > I just added an "endoapply" method for XVectorList objects (DNAStringSet > objects are just particular XVectorList objects) that is as fast as > doing lapply(): > > > library(Biostrings) > > library(hgu95av2probe) > > probes <- DNAStringSet(hgu95av2probe) > > system.time(probes2 <- endoapply(probes, sample)) > user system elapsed > 389.928 0.188 390.935 > > Still not particularly fast but 10x faster than before. > > This is in Biostrings 2.27.12 (devel). > > A faster (but more complicated) way to go is to use a 3-step approach: > (1) unlist 'probes' (produces a single big DNAString object), (2) > operate on the unlisted object, and (3) relist it: > > (1) Unlist: > > unlisted <- unlist(probes) > > (2) Shuffle the nucleotides in 'unlisted'. One complication is that > the shuffling must not move nucleotides across the boundaries > of the probes. This requires a little bit of extra work: > > probes_width <- width(probes) > shuffling_idx <- IntegerList(lapply(probes_**width, sample)) > offsets <- cumsum(c(0L, probes_width[-length(probes_**width)])) > shuffling_idx <- unlist(shuffling_idx + offsets) > > ## Sanity check. > stopifnot(identical(sort(**shuffling_idx), seq_along(unlisted))) > > unlisted2 <- unlisted[shuffling_idx] > > (3) Relist. Unfortunately, there is no "relist' method for DNAString > objects at the moment (I still need to add one). In the mean > time we can do: > > probes2b <- as(successiveViews(unlisted2, width(probes)), > "DNAStringSet") > > With this method, it takes only about 10 sec on my laptop to shuffle > the nucleotides of all the probes in hgu95av2probe. If you are careful > to set the RNG seed to the same value (via set.seed) before you run > endoapply() and the 3-step approach, you'll end up with exactly the same > result i.e. 'identical(as.character(**probes2b), as.character(probes2))' > will be TRUE. > > Cheers, > H. > > > On 03/21/2013 02:07 PM, Steve Lianoglou wrote: > >> And upon one more second's worth of inspection, the endoapply >> suggestion actually is a for-loop under the covers, so you won't be >> buying time ... I guess the lapply will go faster ... >> >> -steve >> >> On Thu, Mar 21, 2013 at 5:05 PM, Steve Lianoglou >> <mailinglist.honeypot@gmail.com> wrote: >> >>> Hi, >>> >>> On Thu, Mar 21, 2013 at 4:48 PM, Chris Seidel <seidel@phaget4.org> >>> wrote: >>> [aggressive clipping] >>> >>> What's odd, is that this actually works: >>>> >>>> DNAStringSet(do.call(c,unlist(**myRandomizedseqs))) >>>> >>>> *IF* the sequences are NOT NAMED. >>>> >>> >>> This (or similar things) have come up before on the ML, but I don't >>> have time to search for it right now. I posted a suggestion that I use >>> "unname" defensively to sidestep these corner cases. Perhaps that will >>> help you find the thread when searching the archives. In any event, >>> you could do: >>> >>> R> DNAStringSet(do.call(c, unname(unlist(...)))) >>> >>> Now that I look at your example, I think the thread I'm talking about >>> might have been slightly different, but I guess this should still work >>> in your case. >>> >>> How does one operate on the sequences of a DNAStringSet object without >>>> getting a list back, or without a for loop? I'm sure there's some >>>> elegant one-liner that completely escapes me. >>>> >>> >>> To randomize the sequences, you could do: >>> >>> R> xx <- DNAStringSet(c("GATACA", "GATCCTAA")) >>> R> endoapply(xx, sample) >>> A DNAStringSet instance of length 2 >>> width seq >>> [1] 6 ACGATA >>> [2] 8 GTCATAAC >>> >>> Where did that come from, right? >>> >>> Note that a DNAStringSet is an IRanges::Vector, and you'll find lots >>> of things in the IRangesOverview vignette, which at first might seem >>> like to long/detailed to read, but will be worth your time. >>> >>> Not sure how fast this will be on large XStringSet object, though. You >>> may not buy yourself more speed than the for loop, but can't test that >>> right now. Perhaps lapply(DNAstringSet, sample) might be faster, but >>> I'll leave this as an exercise for the reader. >>> >>> HTH, >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Defender of The Thesis >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/**contact<http: cbio.="" mskcc.org="" ~lianos="" contact=""> >>> >> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Steve Lianoglou Defender of The Thesis | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Steve Lianoglou ★ 13k

score 0 · Answer 2 · 2013-03-22

Hello Steve and Herve, Thanks very much! Holy cow! Randomize a string set with endoapply(xx, sample). This is the essence of my love/hate relationship with programming. I've spent my share of quality time with the IRanges vignette, and I still would have never thought of that 21 character one- liner! I did run across your previous post Steve about unname, it was what prompted my experimentation. I'm also glad to know that my intition of how to sew things back together might involve some kind of Views function (as Herve suggests). Anyway, thanks for the suggestions and solutions. Enormously helpful. The application involves searching ChIP Seq peaks (as represented by a DNAStringSet) for cis-regulatory elements, and scrambing the sequences is one method of creating a background, and the methods below are easy, fast, and work great. -Chris On Fri, Mar 22, 2013 at 9:18 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Chris, Steve, > > I tried Steve's 'endoapply(xx, sample)' and was surprised/disappointed > that it's about 10x slower than doing lapply() :-/ > The reason for that is that it was actually calling the "endoapply" > method for List objects, which is very inefficient. > > I just added an "endoapply" method for XVectorList objects (DNAStringSet > objects are just particular XVectorList objects) that is as fast as > doing lapply(): > > > library(Biostrings) > > library(hgu95av2probe) > > probes <- DNAStringSet(hgu95av2probe) > > system.time(probes2 <- endoapply(probes, sample)) > user system elapsed > 389.928 0.188 390.935 > > Still not particularly fast but 10x faster than before. > > This is in Biostrings 2.27.12 (devel). > > A faster (but more complicated) way to go is to use a 3-step approach: > (1) unlist 'probes' (produces a single big DNAString object), (2) > operate on the unlisted object, and (3) relist it: > > (1) Unlist: > > unlisted <- unlist(probes) > > (2) Shuffle the nucleotides in 'unlisted'. One complication is that > the shuffling must not move nucleotides across the boundaries > of the probes. This requires a little bit of extra work: > > probes_width <- width(probes) > shuffling_idx <- IntegerList(lapply(probes_width, sample)) > offsets <- cumsum(c(0L, probes_width[-length(probes_width)])) > shuffling_idx <- unlist(shuffling_idx + offsets) > > ## Sanity check. > stopifnot(identical(sort(shuffling_idx), seq_along(unlisted))) > > unlisted2 <- unlisted[shuffling_idx] > > (3) Relist. Unfortunately, there is no "relist' method for DNAString > objects at the moment (I still need to add one). In the mean > time we can do: > > probes2b <- as(successiveViews(unlisted2, width(probes)), > "DNAStringSet") > > With this method, it takes only about 10 sec on my laptop to shuffle > the nucleotides of all the probes in hgu95av2probe. If you are careful > to set the RNG seed to the same value (via set.seed) before you run > endoapply() and the 3-step approach, you'll end up with exactly the same > result i.e. 'identical(as.character(probes2b), as.character(probes2))' > will be TRUE. > > Cheers, > H. > > > On 03/21/2013 02:07 PM, Steve Lianoglou wrote: >> >> And upon one more second's worth of inspection, the endoapply >> suggestion actually is a for-loop under the covers, so you won't be >> buying time ... I guess the lapply will go faster ... >> >> -steve >> >> On Thu, Mar 21, 2013 at 5:05 PM, Steve Lianoglou >> <mailinglist.honeypot at="" gmail.com=""> wrote: >>> >>> Hi, >>> >>> On Thu, Mar 21, 2013 at 4:48 PM, Chris Seidel <seidel at="" phaget4.org=""> wrote: >>> [aggressive clipping] >>> >>>> What's odd, is that this actually works: >>>> >>>> DNAStringSet(do.call(c,unlist(myRandomizedseqs))) >>>> >>>> *IF* the sequences are NOT NAMED. >>> >>> >>> This (or similar things) have come up before on the ML, but I don't >>> have time to search for it right now. I posted a suggestion that I use >>> "unname" defensively to sidestep these corner cases. Perhaps that will >>> help you find the thread when searching the archives. In any event, >>> you could do: >>> >>> R> DNAStringSet(do.call(c, unname(unlist(...)))) >>> >>> Now that I look at your example, I think the thread I'm talking about >>> might have been slightly different, but I guess this should still work >>> in your case. >>> >>>> How does one operate on the sequences of a DNAStringSet object without >>>> getting a list back, or without a for loop? I'm sure there's some >>>> elegant one-liner that completely escapes me. >>> >>> >>> To randomize the sequences, you could do: >>> >>> R> xx <- DNAStringSet(c("GATACA", "GATCCTAA")) >>> R> endoapply(xx, sample) >>> A DNAStringSet instance of length 2 >>> width seq >>> [1] 6 ACGATA >>> [2] 8 GTCATAAC >>> >>> Where did that come from, right? >>> >>> Note that a DNAStringSet is an IRanges::Vector, and you'll find lots >>> of things in the IRangesOverview vignette, which at first might seem >>> like to long/detailed to read, but will be worth your time. >>> >>> Not sure how fast this will be on large XStringSet object, though. You >>> may not buy yourself more speed than the for loop, but can't test that >>> right now. Perhaps lapply(DNAstringSet, sample) might be faster, but >>> I'll leave this as an exercise for the reader. >>> >>> HTH, >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Defender of The Thesis >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center