library(GenomicRanges) gr <- GRanges( Rle(c('chr17', 'chr5')), IRanges(c(7573997,112173451),c(7573997,112173451)) )
If I want to get the hg19 transcripts associated with these ranges I can do for instance:
library(TxDb.Hsapiens.UCSC.hg19.knownGene) tx <- TxDb.Hsapiens.UCSC.hg19.knownGene transcriptsByOverlaps(tx, gr, columns=c("tx_name", "CDSID")) GRanges object with 17 ranges and 2 metadata columns: seqnames ranges strand | tx_name CDSID <Rle> <IRanges> <Rle> | <character> <IntegerList> [1] chr5 [112043202, 112181936] + | uc011cvt.2 65327,65329,65330,... [2] chr5 [112073556, 112181936] + | uc003kpy.4 NA,65328,65329,... [3] chr5 [112073556, 112181936] + | uc003kpz.4 NA,NA,65328,... [4] chr5 [112073556, 112181936] + | uc010jbz.3 NA,NA,NA,... [5] chr5 [112157593, 112181936] + | uc010jca.3 NA,NA,65344 ... ... ... ... ... ... ... [13] chr17 [7571720, 7590868] - | uc002gij.3 NA,NA,NA,... [14] chr17 [7571720, 7590868] - | uc002gim.3 NA,183308,183307,... [15] chr17 [7571720, 7590868] - | uc010cnh.2 NA,183308,183307,... [16] chr17 [7571720, 7590868] - | uc010cni.2 NA,183308,183307,... [17] chr17 [7571720, 7590868] - | uc031qyq.1 NA,NA,183305,... ------- seqinfo: 93 sequences (1 circular) from hg19 genome
But I can not figure out how to get the amino acids codons at precisely these positions?
Couple of minor points:
seqnames(my_hits)
instead of coercing to a data.frame?select()
accept an Rle so the character coercion is unnecessary.data.frame(Final, annotation, codon_no, aa)
, which is perhaps easier to read, but you might consider adding the last three columns to themcols(Final)
, so that the GRanges is preserved. You can always convert to data.frame later when absolutely needed.Sure - thanks for the comments (many of the Ranges methods are still somewhat new to me and I need to get used using them).
Thanks so much for this extensive reply.
For people continuing to struggle with this topic, the
ensembldb
package has some nice functionality for this:To comment on the above, I believe that the transcript is represented in
cds
as aGRangesList
object in which eachGRanges
is composed of a set of ranges corresponding to the transcript exons. In that case, I think that the base pair positions identified inmy_hits
would be exon-relative positions, which would need to be assembled and translated into transcript-relative positions if the transcript has more than one exon.