Search
Question: What is the best way to subset a DNAStringSet by a GRange ?
0
gravatar for reuscher.stefan
7 months ago by
reuscher.stefan0 wrote:

Hi All,

I recently had some task which I thought should be easy to accomplish. I wanted to subset a DNAStringSet (a whole genome) using the information in a GRanges object (output of tileGenome()). I searched quite a bit in the help and here but could not find a "bioconductor" way of doing it and ended up for-looping through the ranges.

There is an exampe below.

My question is: Am I missing something or is there really no connection between GRanges and BioStrings ?

Best Regards,

Stefan

 

Reproducible example:

 

testRanges <- GRanges(seqnames = c("seq1","seq1","seq2"),
                    ranges = IRanges(start = c(10,20,10),
                                     width = c(5,5,5))
)

testSeqs <- DNAStringSet(c("tagtagctagctacgatgcatgctacgatgcatgc",
                           "tatgctagctacgtacgtacgatcgtacgtagcta")
)
names(testSeqs) <- c("seq1","seq2")

#what I want
#newTestSeqs <- subseq(testSeqs, by = testRanges)

#what I do
testMask <- data.frame(seq = as.character(seqnames(testRanges)),
                       start = start(testRanges),
                       end = end(testRanges))

newTestSeqs <- DNAStringSet()
for (i in 1:nrow(testMask)){

  temp <- testSeqs[testMask[i,"seq"]]

  temp <- subseq(temp,
                 start = testMask[i,"start"],
                 end = testMask[i,"end"])

  newTestSeqs <- c(newTestSeqs,temp)
}

 

ADD COMMENTlink modified 7 months ago by Michael Lawrence9.8k • written 7 months ago by reuscher.stefan0
2
gravatar for Michael Lawrence
7 months ago by
United States
Michael Lawrence9.8k wrote:

subseq() acts like substring(): it is vectorized over its input (x), so the seqnames of the GRanges wouldn't apply. Instead, just use this syntax:

testSeqs[testRanges]
ADD COMMENTlink written 7 months ago by Michael Lawrence9.8k

Thanks Michael, that is exactly what is was looking for and could not find on any of the help files, (or I overlooked it).

Since this syntax is not obvious to me, do you mind explaining what is going on here. Suppose I want to isolate one subsequence from a DNAStringSet, I can put in a GRanges, but how can I do this using more low-level objects. For e.g. dataframes I do:

x[row, column]

So for DNAStringSets it must be somethings like:

testSeqs["seq1", 10, 5]

to achieve the same thing as the first range in the example above.

Best Regards,

 

Stefan

ADD REPLYlink written 7 months ago by reuscher.stefan0

A DNAStringSet is one dimensional, so it should only take a single subscript. As a list, it is also hierarchical. The GRanges subscripting is essentially a shortcut for a "nested" subscript (like looping x[seqnames], extracting the corresponding range).

ADD REPLYlink written 7 months ago by Michael Lawrence9.8k

A hierarchical, one-dimensional subscript. That's exactly what I need to know to. I will figure it out from here.

Thanks for your great advice.

ADD REPLYlink written 7 months ago by reuscher.stefan0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 282 users visited in the last hour