Question

What is the best way to subset a DNAStringSet by a GRange ?

0

Entering edit mode

reuscher.stefan • 0

@reuscherstefan-12824

Last seen 7.0 years ago

Hi All,

I recently had some task which I thought should be easy to accomplish. I wanted to subset a DNAStringSet (a whole genome) using the information in a GRanges object (output of tileGenome()). I searched quite a bit in the help and here but could not find a "bioconductor" way of doing it and ended up for-looping through the ranges.

There is an exampe below.

My question is: Am I missing something or is there really no connection between GRanges and BioStrings ?

Best Regards,

Stefan

Reproducible example:

testRanges <- GRanges(seqnames = c("seq1","seq1","seq2"),
                    ranges = IRanges(start = c(10,20,10),
                                     width = c(5,5,5))
)

testSeqs <- DNAStringSet(c("tagtagctagctacgatgcatgctacgatgcatgc",
                           "tatgctagctacgtacgtacgatcgtacgtagcta")
)
names(testSeqs) <- c("seq1","seq2")

#what I want
#newTestSeqs <- subseq(testSeqs, by = testRanges)

#what I do
testMask <- data.frame(seq = as.character(seqnames(testRanges)),
                       start = start(testRanges),
                       end = end(testRanges))

newTestSeqs <- DNAStringSet()
for (i in 1:nrow(testMask)){

  temp <- testSeqs[testMask[i,"seq"]]

  temp <- subseq(temp,
                 start = testMask[i,"start"],
                 end = testMask[i,"end"])

  newTestSeqs <- c(newTestSeqs,temp)
}

bioconductor biostrings granges • 4.2k views

ADD COMMENT • link updated 7.0 years ago by Michael Lawrence ★ 11k • written 7.0 years ago by reuscher.stefan • 0

score 2 · Accepted Answer · 2017-04-12

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

subseq() acts like substring(): it is vectorized over its input (x), so the seqnames of the GRanges wouldn't apply. Instead, just use this syntax:

testSeqs[testRanges]

ADD COMMENT • link 7.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks Michael, that is exactly what is was looking for and could not find on any of the help files, (or I overlooked it).

Since this syntax is not obvious to me, do you mind explaining what is going on here. Suppose I want to isolate one subsequence from a DNAStringSet, I can put in a GRanges, but how can I do this using more low-level objects. For e.g. dataframes I do:

x[row, column]

So for DNAStringSets it must be somethings like:

testSeqs["seq1", 10, 5]

to achieve the same thing as the first range in the example above.

Best Regards,

Stefan

ADD REPLY • link 7.0 years ago reuscher.stefan • 0

0

Entering edit mode

A DNAStringSet is one dimensional, so it should only take a single subscript. As a list, it is also hierarchical. The GRanges subscripting is essentially a shortcut for a "nested" subscript (like looping x[seqnames], extracting the corresponding range).

ADD REPLY • link 7.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

A hierarchical, one-dimensional subscript. That's exactly what I need to know to. I will figure it out from here.

Thanks for your great advice.

ADD REPLY • link 7.0 years ago reuscher.stefan • 0