Search
Question: What is the best way to subset a DNAStringSet by a GRange ?
0
14 months ago by
reuscher.stefan0 wrote:

Hi All,

I recently had some task which I thought should be easy to accomplish. I wanted to subset a DNAStringSet (a whole genome) using the information in a GRanges object (output of tileGenome()). I searched quite a bit in the help and here but could not find a "bioconductor" way of doing it and ended up for-looping through the ranges.

There is an exampe below.

My question is: Am I missing something or is there really no connection between GRanges and BioStrings ?

Best Regards,

Stefan

Reproducible example:

testRanges <- GRanges(seqnames = c("seq1","seq1","seq2"),
ranges = IRanges(start = c(10,20,10),
width = c(5,5,5))
)

testSeqs <- DNAStringSet(c("tagtagctagctacgatgcatgctacgatgcatgc",
"tatgctagctacgtacgtacgatcgtacgtagcta")
)
names(testSeqs) <- c("seq1","seq2")

#what I want
#newTestSeqs <- subseq(testSeqs, by = testRanges)

#what I do
start = start(testRanges),
end = end(testRanges))

newTestSeqs <- DNAStringSet()

temp <- subseq(temp,

newTestSeqs <- c(newTestSeqs,temp)
}

modified 14 months ago by Michael Lawrence10.0k • written 14 months ago by reuscher.stefan0
2
14 months ago by
Michael Lawrence10.0k
United States
Michael Lawrence10.0k wrote:

subseq() acts like substring(): it is vectorized over its input (x), so the seqnames of the GRanges wouldn't apply. Instead, just use this syntax:

testSeqs[testRanges]

Thanks Michael, that is exactly what is was looking for and could not find on any of the help files, (or I overlooked it).

Since this syntax is not obvious to me, do you mind explaining what is going on here. Suppose I want to isolate one subsequence from a DNAStringSet, I can put in a GRanges, but how can I do this using more low-level objects. For e.g. dataframes I do:

x[row, column]

So for DNAStringSets it must be somethings like:

testSeqs["seq1", 10, 5]

to achieve the same thing as the first range in the example above.

Best Regards,

Stefan

A DNAStringSet is one dimensional, so it should only take a single subscript. As a list, it is also hierarchical. The GRanges subscripting is essentially a shortcut for a "nested" subscript (like looping x[seqnames], extracting the corresponding range).

A hierarchical, one-dimensional subscript. That's exactly what I need to know to. I will figure it out from here.