Question

Subsetting a DNAStringSet object with many DNAString's

0

Entering edit mode

dr ▴ 10

@dr-9473

Last seen 21 months ago

United States

Hi,

I have a Biostrings DNAStringSet object with many DNAStrings in it, and I want to subset each one of them from position 1 to the minimum between its length and a fixed cutoff.

So far I'm using a for loop for this, as in this example:

library(dplyr)
set.seed(1)
seq.set <- lapply(1:100, function(s) paste(sample(c("A","C","G","T"),as.integer(abs(rnorm(1,500,1000))),replace = T), collapse="")) %>%
  unlist() %>%
  Biostrings::DNAStringSet(.)

for(s in 1:length(seq.set))
  seq.set[s] <- Biostrings::subseq(seq.set[s], 1, min(650, Biostrings::width(seq.set[s])))

But because in reality the size of my DNAStringSet is ~200,000 DNAStrings it takes quite a while. Any faster solution?

Biostrings • 1.9k views

ADD COMMENT • link updated 2.6 years ago by Hervé Pagès 16k • written 2.6 years ago by dr ▴ 10

score 0 · Answer 1 · 2022-05-12

Seems like an lapply to find the end point of each DNAStrings object in the DNAStringSet and then simply providing that as the end argument to the Biostrings::subseq function is the way to go:

library(dplyr)
set.seed(1)
seq.set <- lapply(1:100, function(s) paste(sample(c("A","C","G","T"),as.integer(abs(rnorm(1,500,1000))),replace = T), collapse="")) %>%
  unlist() %>%
  Biostrings::DNAStringSet(.)

seq.set.ends <- lapply(1:length(seq.set),function(i) min(650, Biostrings::width(seq.set[i]))) %>% unlist()
seq.set <- Biostrings::subseq(seq.set,start = rep(1,length(seq.set)),end = seq.set.ends)

score 0 · Answer 2 · 2022-05-12

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 6 days ago

Seattle, WA, United States

Try heads(x, cutoff)

H.

ADD COMMENT • link 2.6 years ago Hervé Pagès 16k