Question

How to find the amino acid codon corresponding to a translated genomic position

0

Entering edit mode

madsheilskov ▴ 10

@madsheilskov-9317

Last seen 3.6 years ago

Denmark

library(GenomicRanges)

gr <-
GRanges(
  Rle(c('chr17', 'chr5')),
  IRanges(c(7573997,112173451),c(7573997,112173451))
  )

If I want to get the hg19 transcripts associated with these ranges I can do for instance:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
tx <- TxDb.Hsapiens.UCSC.hg19.knownGene

transcriptsByOverlaps(tx, gr, columns=c("tx_name", "CDSID"))
GRanges object with 17 ranges and 2 metadata columns:
       seqnames                 ranges strand   |     tx_name                 CDSID
          <Rle>              <IRanges>  <Rle>   | <character>         <IntegerList>
   [1]     chr5 [112043202, 112181936]      +   |  uc011cvt.2 65327,65329,65330,...
   [2]     chr5 [112073556, 112181936]      +   |  uc003kpy.4    NA,65328,65329,...
   [3]     chr5 [112073556, 112181936]      +   |  uc003kpz.4       NA,NA,65328,...
   [4]     chr5 [112073556, 112181936]      +   |  uc010jbz.3          NA,NA,NA,...
   [5]     chr5 [112157593, 112181936]      +   |  uc010jca.3           NA,NA,65344
   ...      ...                    ...    ... ...         ...                   ...
  [13]    chr17     [7571720, 7590868]      -   |  uc002gij.3          NA,NA,NA,...
  [14]    chr17     [7571720, 7590868]      -   |  uc002gim.3  NA,183308,183307,...
  [15]    chr17     [7571720, 7590868]      -   |  uc010cnh.2  NA,183308,183307,...
  [16]    chr17     [7571720, 7590868]      -   |  uc010cni.2  NA,183308,183307,...
  [17]    chr17     [7571720, 7590868]      -   |  uc031qyq.1      NA,NA,183305,...
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

But I can not figure out how to get the amino acids codons at precisely these positions?

genomicranges genomicfeatures translation • 3.2k views

ADD COMMENT • link updated 12 weeks ago by russ.bainer ▴ 10 • written 8.4 years ago by madsheilskov ▴ 10

1

Entering edit mode

madsheilskov ▴ 10

@madsheilskov-9317

Last seen 3.6 years ago

Denmark

Yeah, a complicated process, but it worked perfectly. I think it would be nice to have an easier method implemented alongside the other nice methods in GenomicRanges and GenomicFeature packages; It is not completely uncommon to search for the amino acid(s) corresponding to a certain position (or range).

Anyway, this is what I did (note used the dplyr package for %>% piping below):

#My data (`gr` in the question) is like (with one annotation mcolumn).

> head(gr)
GRanges object with 6 ranges and 1 metadata column:
      seqnames               ranges strand |  id_tregion
         <Rle>            <IRanges>  <Rle> | <character>
  [1]    chr17 [ 7573837,  7573837]      * |       TR001
  [2]    chr17 [ 7576853,  7576853]      * |       TR002
  [3]    chr17 [ 7577509,  7577509]      * |       TR003


# First I get the genomic coordinates of all coding sequences in tx (hg19):

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
tx <- TxDb.Hsapiens.UCSC.hg19.knownGene
cds <- cdsBy(tx) 


#Then, use mapToTranscripts() to get a Hits-like object with which I can find the cds sequences:

my_hits <- mapToTranscripts(gr, cds)


#Find the cds sequence coordinates using using subjectHits - column 2 - as indeces in cdc:

my_cds <- cdc[mcols(my_hits)[[2]]]


#Get the actual coding sequence from the coordinates:

library(BSgenome.Hsapiens.UCSC.hg19)
BShg19 <- BSgenome.Hsapiens.UCSC.hg19
my_seq <- extractTranscriptSeqs(BShg19, my_cdc)


#The first complete coding sequence (starting with ATG) is:
> my_seq[1]
  A DNAStringSet instance of length 1
    width seq                                                                                                        names          
[1]   858 ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAA...TCAGGAGTTCAAGACAGCCTGGCCAACATAGCGAAATCCCATCTCTACTAA 62635


#And the corresponding translated amino acids sequences are:



my_aa <- translate(my_seq)


#The position in the sequences is given by the range in `my_hits`:
bp <- start(my_hits) #in base pair while bp %/% 3 gives the codon position.

#Hence, the individual amino acid and codon number my initial query positions are:

aa <- subseq(my_aa, bp %/% 3, bp %/% 3 ) %>% as.vector()
codon_no <- bp %/% 3

Since I got several 1-to-many hits in the mapToTranscripts() method above I kept all the possible hits in a data.frame for furture filtering, and added some annotations as well:

TXID_hits <- as.character(as.data.frame(my_hits)$seqnames)
library(Homo.sapiens)
annotation <-  select(Homo.sapiens, keys=TXID_hits, columns=c('SYMBOL','TXNAME'),
       keytype = 'TXID' )

Final <- gr[mcols(my_hits)[[1]]] 
Final <- as.data.frame(Final) %>%  cbind(annotation, codon_no, aa)

ADD COMMENT • link 8.4 years ago madsheilskov ▴ 10

0

Entering edit mode

Couple of minor points:

Are you aware that you can write seqnames(my_hits) instead of coercing to a data.frame?
I think we can make select() accept an Rle so the character coercion is unnecessary.
The last line could be written as data.frame(Final, annotation, codon_no, aa), which is perhaps easier to read, but you might consider adding the last three columns to the mcols(Final), so that the GRanges is preserved. You can always convert to data.frame later when absolutely needed.

ADD REPLY • link 8.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Sure - thanks for the comments (many of the Ranges methods are still somewhat new to me and I need to get used using them).

ADD REPLY • link 8.4 years ago madsheilskov ▴ 10

0

Entering edit mode

Thanks so much for this extensive reply.

For people continuing to struggle with this topic, the ensembldb package has some nice functionality for this:

> library(EnsDb.Hsapiens.v86)
> edb <- EnsDb.Hsapiens.v86
> 
> gr
GRanges object with 2 ranges and 0 metadata columns:
            seqnames              ranges strand
               <Rle>           <IRanges>  <Rle>
  region_20        2 113246754-113246919      -
  region_21        2 113244427-113244624      -
  -------
  seqinfo: 25 sequences (1 circular) from an unspecified genome

> prot_map <- genomeToProtein(gr, edb)
Warning message:
Transcript(s) 'ENST00000467778' do/does not encode a protein 
> 
> prot_seq <- proteins(edb, 
                      filter = ProteinIdFilter(unique(prot_map@unlistData@NAMES)),
                      return.type = "AAStringSet")
>  
> prot_seq
AAStringSet object of length 5:
    width seq                                                                                                                                       names               
[1]   450 MPHNSIRSGHGGLNQLGGAFVNGRPLPEVVRQRIVDLAHQGVRPCDISRQLRVSHGCVSKILGRYYE...GSYASSAIAGMVAGSEYSGNAYGHTPYSSYSEAWRFPNSSLLSSPYYYSSTSRPSAPPTTATAFDHL ENSP00000263334
[2]   321 MPHNSIRSGHGGLNQLGGAFVNGRPLPEVVRQRIVDLAHQGVRPCDISRQLRVSHGCVSKILGRYYE...TKGEQGERWWGPRCPDTHPTSPPADRAAMPPLPSQAWWQEVNTLAMPMATPPTPPTARPGASPTPAC ENSP00000263335
[3]   398 MPHNSIRSGHGGLNQLGGAFVNGRPLPEVVRQRIVDLAHQGVRPCDISRQLRVSHGCVSKILGRYYE...RPSSQGERWWGPRCPDTHPTSPPADRAAMPPLPSQAWWQEVNTLAMPMATPPTPPTARPGASPTPAC ENSP00000314750
[4]   287 MPHNSIRSGHGGLNQLGGAFVNGRPLPEVVRQRIVDLAHQGVRPCDISRQLRVSHGCVSKILGRYYE...KHLRTDAFSQHHLEPLECPFERQHYPEAYASPSHTKGEQEVNTLAMPMATPPTPPTARPGASPTPAC ENSP00000380768
[5]   450 MPHNSIRSGHGGLNQLGGAFVNGRPLPEVVRQRIVDLAHQGVRPCDISRQLRVSHGCVSKILGRYYE...GSYASSAIAGMVAGSEYSGNAYGHTPYSSYSEAWRFPNSSLLSSPYYYSSTSRPSAPPTTATAFDHL ENSP00000395498

> prot_map
IRangesList object of length 2:
[[1]]
IRanges object with 5 ranges and 8 metadata columns:
                      start       end     width |           tx_id    cds_ok         exon_id exon_rank seq_start   seq_end    seq_name  seq_strand
                  <integer> <integer> <integer> |     <character> <logical>     <character> <integer> <integer> <integer> <character> <character>
  ENSP00000263334         9        64        56 | ENST00000263334      TRUE ENSE00003489433         3 113246754 113246919           2           -
  ENSP00000263335         9        64        56 | ENST00000263335      TRUE ENSE00003489433         3 113246754 113246919           2           -
  ENSP00000314750         9        64        56 | ENST00000348715      TRUE ENSE00003489433         3 113246754 113246919           2           -
  ENSP00000380768         9        64        56 | ENST00000397647      TRUE ENSE00003489433         3 113246754 113246919           2           -
  ENSP00000395498         9        64        56 | ENST00000429538      TRUE ENSE00003489433         3 113246754 113246919           2           -

[[2]]
IRanges object with 5 ranges and 8 metadata columns:
                      start       end     width |           tx_id    cds_ok         exon_id exon_rank seq_start   seq_end    seq_name  seq_strand
                  <integer> <integer> <integer> |     <character> <logical>     <character> <integer> <integer> <integer> <character> <character>
  ENSP00000263334        64       130        67 | ENST00000263334      TRUE ENSE00003582162         4 113244427 113244624           2           -
  ENSP00000263335        64       130        67 | ENST00000263335      TRUE ENSE00003582162         4 113244427 113244624           2           -
  ENSP00000314750        64       130        67 | ENST00000348715      TRUE ENSE00003582162         4 113244427 113244624           2           -
  ENSP00000380768        64       130        67 | ENST00000397647      TRUE ENSE00003582162         4 113244427 113244624           2           -
  ENSP00000395498        64       130        67 | ENST00000429538      TRUE ENSE00003582162         4 113244427 113244624           2           -

To comment on the above, I believe that the transcript is represented in cds as a GRangesList object in which each GRanges is composed of a set of ranges corresponding to the transcript exons. In that case, I think that the base pair positions identified in my_hits would be exon-relative positions, which would need to be assembled and translated into transcript-relative positions if the transcript has more than one exon.

ADD REPLY • link 12 weeks ago russ.bainer ▴ 10

score 3 · Accepted Answer · 2015-12-04

This would be a complex process. First, use cdsBy to retrieve the complete transcript structure and map the positions into the CDS coordinates (mapToTranscripts). Extract the spliced sequences of the hits (extractTranscriptSeqs), map the CDS coordinates to codon ranges (math with %%3), and finally extract the codon sequences (extractAt).

What are you trying to do with these codons? Are you looking at coding effects of variants? Perhaps VariantAnnotation::predictCoding would be a lot easier.