What is the best way to subset a DNAStringSet by a GRange ?
1
0
Entering edit mode
@reuscherstefan-12824
Last seen 7.0 years ago

Hi All,

I recently had some task which I thought should be easy to accomplish. I wanted to subset a DNAStringSet (a whole genome) using the information in a GRanges object (output of tileGenome()). I searched quite a bit in the help and here but could not find a "bioconductor" way of doing it and ended up for-looping through the ranges.

There is an exampe below.

My question is: Am I missing something or is there really no connection between GRanges and BioStrings ?

Best Regards,

Stefan

 

Reproducible example:

 

testRanges <- GRanges(seqnames = c("seq1","seq1","seq2"),
                    ranges = IRanges(start = c(10,20,10),
                                     width = c(5,5,5))
)

testSeqs <- DNAStringSet(c("tagtagctagctacgatgcatgctacgatgcatgc",
                           "tatgctagctacgtacgtacgatcgtacgtagcta")
)
names(testSeqs) <- c("seq1","seq2")

#what I want
#newTestSeqs <- subseq(testSeqs, by = testRanges)

#what I do
testMask <- data.frame(seq = as.character(seqnames(testRanges)),
                       start = start(testRanges),
                       end = end(testRanges))

newTestSeqs <- DNAStringSet()
for (i in 1:nrow(testMask)){

  temp <- testSeqs[testMask[i,"seq"]]

  temp <- subseq(temp,
                 start = testMask[i,"start"],
                 end = testMask[i,"end"])

  newTestSeqs <- c(newTestSeqs,temp)
}

 

bioconductor biostrings granges • 4.2k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

subseq() acts like substring(): it is vectorized over its input (x), so the seqnames of the GRanges wouldn't apply. Instead, just use this syntax:

testSeqs[testRanges]
ADD COMMENT
0
Entering edit mode

Thanks Michael, that is exactly what is was looking for and could not find on any of the help files, (or I overlooked it).

Since this syntax is not obvious to me, do you mind explaining what is going on here. Suppose I want to isolate one subsequence from a DNAStringSet, I can put in a GRanges, but how can I do this using more low-level objects. For e.g. dataframes I do:

x[row, column]

So for DNAStringSets it must be somethings like:

testSeqs["seq1", 10, 5]

to achieve the same thing as the first range in the example above.

Best Regards,

 

Stefan

ADD REPLY
0
Entering edit mode

A DNAStringSet is one dimensional, so it should only take a single subscript. As a list, it is also hierarchical. The GRanges subscripting is essentially a shortcut for a "nested" subscript (like looping x[seqnames], extracting the corresponding range).

ADD REPLY
0
Entering edit mode

A hierarchical, one-dimensional subscript. That's exactly what I need to know to. I will figure it out from here.

Thanks for your great advice.

ADD REPLY

Login before adding your answer.

Traffic: 735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6