Search
Question: How to find the amino acid codons corresponding to a subset of a range of genomic positions
0
2.5 years ago by
Denmark

Recently I asked how one - if given some genomic positions as a GRanges object - could find the amino acid codons that corresponded to these positions (if in a coding exon): C: How to find the amino acid codon corresponding to a translated genomic position . Got a nice solution from the site.

I now find myself in a similar but somewhat extended (complicated ?) situation. I have a GRanges with ~1000 > width > 330  bases. Parts of these ranges overlap with coding exons and I wish to know what are the corresponding amino acid sequences in these overlaps. The catch is that part of each range - since they are genomic - may also overlap with intron or untranslated parts of exons before they 'continue' into some translated part. tried to fiddle around the solution I got in the answer to the above question/link, but couldnt figure it out. How can I find the amino acids "overlapping" these ranges ?

modified 2.5 years ago by Michael Lawrence10.0k • written 2.5 years ago by madsheilskov0
1
2.5 years ago by
Michael Lawrence10.0k
United States
Michael Lawrence10.0k wrote:

A first step would probably be to match the ranges to overlapping transcripts (cds) with findOverlaps(). Then, subscript with the hits to line up the ranges with the transcripts and find the portion of each range within the matched transcript with pintersect(). Use pmapToTranscript() to map coordinates, then proceed as before.

I get the idea, but is stucked after pintersect(). This returns a GRangesList object with the coordinates of the overlaps with transcritps exons and includes a boolean hit mcols. How do I subset this object's TRUE values in order to get the genomic positions (of cds transcripts) with overlap to my regions. I gues this is my next step. The object is like:

GRangesList object of length 249:
\$62638
GRanges object with 7 ranges and 4 metadata columns:
seqnames             ranges strand |    cds_id    cds_name exon_rank       hit
<Rle>          <IRanges>  <Rle> | <integer> <character> <integer> <logical>
[1]    chr17 [7578371, 7578370]      - |    183302        <NA>         1     FALSE
[2]    chr17 [7578177, 7578176]      - |    183301        <NA>         2     FALSE
[3]    chr17 [7577499, 7577498]      - |    183298        <NA>         3     FALSE
[4]    chr17 [7577019, 7577018]      - |    183297        <NA>         4     FALSE
[5]    chr17 [7576853, 7576852]      - |    183296        <NA>         5     FALSE
[6]    chr17 [7573927, 7574033]      - |    183293        <NA>         6      TRUE
[7]    chr17 [7572927, 7572926]      - |    183292        <NA>         7     FALSE

Pass drop.nohit.ranges=TRUE to pintersect().

Smart, and unlist() afterwards gets you a nice GRanges object. Anyway, do you extract/access individual elements of a GRangeList object; would it be something along lapply perhaps ?

I don't think you want to unlist, at least not until after the pmapToTranscript() call. While lapply() does of course work (GRangesList is essentially a list), it is typically not the most convenient, nor the most performant, way to things.

This will not work as pmapToTranscripts() can not take two GrangesLists, namely the one from my pintersect() and the other being cds transcripts from cdsBy().

Perhaps I should not use cdsBy() here (that would intuitively make sense though) ?

Yet, maybe I should not have used unlist() as I now more ranges than the initial 249 list elements since some of my regions ranges overlaps multiple exons. Hence I loose the connection to my query (the ranges) and cannot use  pmapToTranscript().

So in the end, I seem to be stucked anyway ?

I use pmapToTranscripts() with the intersect coordinates and the GrangesList of cds transcripts from cdsBy() which does not work as both are List objects.

1

Basic strategy is

grl_tx <- pmapToTranscripts(unlist(grl), cds[togroup(grl)])

gr <- unlist(reduce(relist(grl_tx, grl)))

But I will add a method to pmapToTranscripts that does that.

Do you mean (following the ideas above):

grl <- pintersect(tr_hits, tx_hits, drop.nohit.ranges=TRUE )

Here tr_hits and tx_hits are output from findOverlaps() as suggested. Then,

grl_tx <- pmapToTranscripts(unlist(grl), cds[togroup(grl)])

does not work since cds[togroup(grl)] only subsets the first few transcripts of cds (togroup(grl) gives a vector 1:length(grl) ) and cds is the collection of hg19 cds transcripts. I must have misunderstood something here.

I was just writing the rough algorithm. In that example, you want to replace cds with tx_hits.