Search
Question: What is the best way to subset a DNAStringSet by a GRange ?
0
gravatar for reuscher.stefan
20 months ago by
reuscher.stefan0 wrote:

Hi All,

I recently had some task which I thought should be easy to accomplish. I wanted to subset a DNAStringSet (a whole genome) using the information in a GRanges object (output of tileGenome()). I searched quite a bit in the help and here but could not find a "bioconductor" way of doing it and ended up for-looping through the ranges.

There is an exampe below.

My question is: Am I missing something or is there really no connection between GRanges and BioStrings ?

Best Regards,

Stefan

 

Reproducible example:

 

testRanges <- GRanges(seqnames = c("seq1","seq1","seq2"),
                    ranges = IRanges(start = c(10,20,10),
                                     width = c(5,5,5))
)

testSeqs <- DNAStringSet(c("tagtagctagctacgatgcatgctacgatgcatgc",
                           "tatgctagctacgtacgtacgatcgtacgtagcta")
)
names(testSeqs) <- c("seq1","seq2")

#what I want
#newTestSeqs <- subseq(testSeqs, by = testRanges)

#what I do
testMask <- data.frame(seq = as.character(seqnames(testRanges)),
                       start = start(testRanges),
                       end = end(testRanges))

newTestSeqs <- DNAStringSet()
for (i in 1:nrow(testMask)){

  temp <- testSeqs[testMask[i,"seq"]]

  temp <- subseq(temp,
                 start = testMask[i,"start"],
                 end = testMask[i,"end"])

  newTestSeqs <- c(newTestSeqs,temp)
}

 

ADD COMMENTlink modified 20 months ago by Michael Lawrence10k • written 20 months ago by reuscher.stefan0
2
gravatar for Michael Lawrence
20 months ago by
United States
Michael Lawrence10k wrote:

subseq() acts like substring(): it is vectorized over its input (x), so the seqnames of the GRanges wouldn't apply. Instead, just use this syntax:

testSeqs[testRanges]
ADD COMMENTlink written 20 months ago by Michael Lawrence10k

Thanks Michael, that is exactly what is was looking for and could not find on any of the help files, (or I overlooked it).

Since this syntax is not obvious to me, do you mind explaining what is going on here. Suppose I want to isolate one subsequence from a DNAStringSet, I can put in a GRanges, but how can I do this using more low-level objects. For e.g. dataframes I do:

x[row, column]

So for DNAStringSets it must be somethings like:

testSeqs["seq1", 10, 5]

to achieve the same thing as the first range in the example above.

Best Regards,

 

Stefan

ADD REPLYlink written 20 months ago by reuscher.stefan0

A DNAStringSet is one dimensional, so it should only take a single subscript. As a list, it is also hierarchical. The GRanges subscripting is essentially a shortcut for a "nested" subscript (like looping x[seqnames], extracting the corresponding range).

ADD REPLYlink written 20 months ago by Michael Lawrence10k

A hierarchical, one-dimensional subscript. That's exactly what I need to know to. I will figure it out from here.

Thanks for your great advice.

ADD REPLYlink written 20 months ago by reuscher.stefan0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 347 users visited in the last hour