Translation Error in predictCoding() for VariantAnnotation
1
0
Entering edit mode
creasyt • 0
@creasyt-8657
Last seen 6.6 years ago
United States

Hi,

I'm noticing what appears to be an incorrect translation of a codon with predictCoding(). It's somewhat difficult to see but below is output of all nonsynonymous amino acid changes. However, variant 3 at bp 1002 shows an M for TTG rather than an L. Is this a bug? Clearly TTG is an L in the standard codon table so I'm not sure how to resolve this (although it can be a start codon in some orgs).

Thanks,

-todd

Here is the code I used to generate a txdb object. This is a custom gff3 file and fasta file for an RSV virus.

txdb <- makeTxDbFromGFF(gff3.file, format="gff3")

coding <- predictCoding(vcf, txdb, seqSource=FaFile(fsa.file))

nonsyn <- coding[(mcols(coding)[,13] == "nonsynonymous")]

 

# GFF3 file

A_NLD_13_005275 manual  region  1       1725    .       +       .       ID=id0
A_NLD_13_005275 manual  gene    1       1725    .       +       .       ID=gene1;Name=F;gbkey=Gene;gene=F
A_NLD_13_005275 manual  mRNA    1       1725    .       +       .       ID=rna1;Parent=gene1;gbkey=mRNA;gene=F
A_NLD_13_005275 manual  exon    1       1725    .       +       .       ID=exon1;Parent=rna1;gbkey=mRNA;gene=F
A_NLD_13_005275 manual  CDS     1       1725    .       +       .       ID=cds1;Parent=rna1;gbkey=CDS;gene=F

 

# FASTA file

>A_NLD_13_005275
ATGGAGTTGCCAATCCTCAAAACAAATGCTATTACCACAATCCTTGCTGCAGTCACACTCTGTTTCGCTTCCAGTCAAAACATCACTGAAGAATTTTATCAATCAACATGCAGTGCAGTTAGCAAAGGCTATCTTAGTGCTCTAAGAACTGGTTGGTATACTAGTGTTATAACTATAGAATTAAGTAATATCAAGGAAAATAAGTGTAATGGTACAGACGCTAAGGTAAAATTAATAAAACAAGAATTAGATAAATATAAAAATGCTGTAACAGAATTGCAGTTGCTCATGCAAAGCACACCAGCAGCCAACAGTCGAGCCAGAAGAGAACTACCAAGATTTATGAATTATACACTCAACAATACCAAAAACACCAATGTAACATTAAGTAAGAAAAGGAAAAGAAGATTTCTTGGATTTTTGTTAGGTGTTGGATCTGCAATCGCCAGTGGCATTGCCGTATCCAAGGTCCTGCACCTAGAAGGGGAAGTGAACAAAATCAAAAGTGCTCTACTATCCACAAACAAGGCTGTAGTCAGCTTATCTAATGGAGTCAGTGTCTTAACCAGCAAGGTGTTAGACCTCAAAAACTATATAGATAAACAGTTGTTACCTATTGTTAACAAGCAAAGCTGCAGCATATCAAACATTGAAACTGTGATAGAGTTCCAACAAAAGAACAACAGACTACTAGAGATTACCAGAGAATTTAGTGTTAATGCAGGTGTAACTACACCTGTAAGCACTTATATGTTAACTAATAGTGAGTTATTATCATTAATCAATGATATGCCTATAACAAATGATCAGAAAAAGTTAATGTCCAGCAATGTTCAAATAGTTAGACAGCAAAGTTACTCTATCATGTCAATAATAAAAGAGGAAGTCTTAGCATATGTAGTACAATTACCACTATATGGTGTAATAGATACTCCTTGTTGGAAACTACACACATCCCCTCTATGTACAACCAACACAAAGGAAGGATCCAACATCTGCTTGACAAGAACCGACAGAGGATGGTACTGTGACAATGCAGGATCAGTATCCTTTTTCCCACAAGCTGAAACATGTAAAGTTCAATCGAATCGGGTGTTTTGTGACACAATGAACAGTTTAACATTACCAAGTGAGGTAAATCTCTGCAACATTGACATATTCAACCCCAAATATGATTGCAAAATTATGACTTCAAAAACAGATGTAAGCAGCTCCGTTATCACATCTCTAGGAGCCATTGTGTCATGCTATGGCAAAACCAAATGTACAGCATCCAATAAAAATCGTGGGATCATAAAGACATTCTCTAACGGGTGTGATTATGTATCAAATAAGGGGGTGGATACTGTGTCTGTAGGTAATACATTATATTATGTAAATAAGCAAGAAGGCAAAAGTCTCTATGTAAAAGGTGAACCAATAATAAATTTCTATGATCCATTAGTGTTCCCCTCTGATGAATTTGATGCATCAATATCTCAAGTCAATGAGAAAATTAATCAGAGTCTAGCATTTATCCGTAAATCAGATGAATTATTACATAATGTAAATGCTGGTAAATCCACCACAAATATCATGATAACTACCATAATTATAGTAATTATAGTAATATTGTTAGCATTAATTGCAGTTGGACTGCTTCTATACTGCAAGGCCAGAAGCACACCAGTCACATTAAGTAAGGATCAACTGAGTGGTATAAATAATATTGCATTTAGTAACTGA

 

GRanges object with 5 ranges and 17 metadata columns:
                                  seqnames       ranges strand | paramRangeID
                                     <Rle>    <IRanges>  <Rle> |     <factor>
    A_NLD_13_005275:11_C/T A_NLD_13_005275 [  11,   11]      + |         <NA>
    A_NLD_13_005275:64_T/A A_NLD_13_005275 [  64,   64]      + |         <NA>
  A_NLD_13_005275:1002_G/A A_NLD_13_005275 [1002, 1002]      + |         <NA>
  A_NLD_13_005275:1541_A/T A_NLD_13_005275 [1541, 1541]      + |         <NA>
  A_NLD_13_005275:1552_G/T A_NLD_13_005275 [1552, 1552]      + |         <NA>
                                      REF                ALT      QUAL
                           <DNAStringSet> <DNAStringSetList> <numeric>
    A_NLD_13_005275:11_C/T              C                  T      <NA>
    A_NLD_13_005275:64_T/A              T                  A      <NA>
  A_NLD_13_005275:1002_G/A              G                  A      <NA>
  A_NLD_13_005275:1541_A/T              A                  T      <NA>
  A_NLD_13_005275:1552_G/T              G                  T      <NA>
                                FILTER      varAllele       CDSLOC
                           <character> <DNAStringSet>    <IRanges>
    A_NLD_13_005275:11_C/T        PASS              T [  11,   11]
    A_NLD_13_005275:64_T/A        PASS              A [  64,   64]
  A_NLD_13_005275:1002_G/A        PASS              A [1002, 1002]
  A_NLD_13_005275:1541_A/T        PASS              T [1541, 1541]
  A_NLD_13_005275:1552_G/T        PASS              T [1552, 1552]
                              PROTEINLOC   QUERYID        TXID         CDSID
                           <IntegerList> <integer> <character> <IntegerList>
    A_NLD_13_005275:11_C/T             4         1           1             1
    A_NLD_13_005275:64_T/A            22         3           1             1
  A_NLD_13_005275:1002_G/A           334         6           1             1
  A_NLD_13_005275:1541_A/T           514        12           1             1
  A_NLD_13_005275:1552_G/T           518        13           1             1
                                GENEID   CONSEQUENCE       REFCODON
                           <character>      <factor> <DNAStringSet>
    A_NLD_13_005275:11_C/T           F nonsynonymous            CCA
    A_NLD_13_005275:64_T/A           F nonsynonymous            TTC
  A_NLD_13_005275:1002_G/A           F nonsynonymous            TTG
  A_NLD_13_005275:1541_A/T           F nonsynonymous            CAT
  A_NLD_13_005275:1552_G/T           F nonsynonymous            GCT
                                 VARCODON         REFAA         VARAA
                           <DNAStringSet> <AAStringSet> <AAStringSet>
    A_NLD_13_005275:11_C/T            CTA             P             L
    A_NLD_13_005275:64_T/A            ATC             F             I
  A_NLD_13_005275:1002_G/A            TTA             M             L
  A_NLD_13_005275:1541_A/T            CTT             H             L
  A_NLD_13_005275:1552_G/T            TCT             A             S
  -------

 

sessionInfo():

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin16.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  methods   stats     graphics  grDevices utils
[8] datasets  base

other attached packages:
 [1] GenomicFeatures_1.28.4     AnnotationDbi_1.38.1
 [3] dplyr_0.7.2                plyr_1.8.4
 [5] gridExtra_2.2.1            ggplot2_2.2.1
 [7] reshape2_1.4.2             VariantAnnotation_1.22.3
 [9] Rsamtools_1.28.0           Biostrings_2.44.2
[11] XVector_0.16.0             SummarizedExperiment_1.6.3
[13] DelayedArray_0.2.7         matrixStats_0.52.2
[15] Biobase_2.36.2             GenomicRanges_1.28.4
[17] GenomeInfoDb_1.12.2        IRanges_2.10.2
[19] S4Vectors_0.14.3           BiocGenerics_0.22.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12             bindr_0.1                compiler_3.4.1
 [4] bitops_1.0-6             tools_3.4.1              zlibbioc_1.22.0
 [7] biomaRt_2.32.1           digest_0.6.12            bit_1.1-12
[10] gtable_0.2.0             RSQLite_2.0              memoise_1.1.0
[13] tibble_1.3.3             lattice_0.20-35          BSgenome_1.44.0
[16] pkgconfig_2.0.1          rlang_0.1.1              Matrix_1.2-10
[19] DBI_0.7                  bindrcpp_0.2             GenomeInfoDbData_0.99.0
[22] stringr_1.2.0            rtracklayer_1.36.4       bit64_0.9-7
[25] grid_3.4.1               glue_1.1.1               R6_2.2.2
[28] XML_3.98-1.9             BiocParallel_1.10.1      magrittr_1.5
[31] blob_1.1.0               scales_0.4.1             GenomicAlignments_1.12.1
[34] assertthat_0.2.0         colorspace_1.3-2         stringi_1.1.5
[37] lazyeval_0.2.0           munsell_0.4.3            RCurl_1.95-4.8
VariantAnnotation • 1.6k views
ADD COMMENT
0
Entering edit mode

I'm not sure why we see '**TTG**' for REFCODON - that's where the trouble starts. I can't try to reproduce this unless I know what annotation you used. Please edit your answer to include the annotation and the output of sessionInfo().

Valerie

ADD REPLY
0
Entering edit mode

Sorry, the "**TTG**" was a mistake. I removed them from the post. I've added some code that I used to generate the custom txdb object which is for a RSV virus, the gff3 data, the fasta sequence, and the sessionInfo(). Let me know what else you need.

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States

Hi todd,

From the Examples section in ?GENETIC_CODE:

  Codons TTG and CTG are "normally" translated to L except when they are
  the first translated codon (a.k.a. start codon or initiation codon),
  in which case they are translated to M. 

The alt_init_codons attribute of GENETIC_CODE lists the alternative initiation codons:

> GENETIC_CODE
TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG 
"F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "*" "W" 
CTT CTC CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG 
"L" "L" "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" 
ATT ATC ATA ATG ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG 
"I" "I" "I" "M" "T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "R" "R" 
GTT GTC GTA GTG GCT GCC GCA GCG GAT GAC GAA GAG GGT GGC GGA GGG 
"V" "V" "V" "V" "A" "A" "A" "A" "D" "D" "E" "E" "G" "G" "G" "G" 
attr(,"alt_init_codons")
[1] "TTG" "CTG"

See ?GENETIC_CODE and ?translate in the Biostrings package for more information and many examples.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Hi Herve,

I saw that, too. However, these are not start codons which is why I brought it up. Plus, I think that setting GENETIC_CODE is only to translate the nucleotide sequence. I *think* the codons are translated in another function later on in a separate function. I noticed that the only time say TTG is coded to a "M" is when it's in the REF position. Never in the ALT position. 

If Valerie can't reproduce, then perhaps there's something I'm doing wrong but I'm pretty sure I'm running it properly. I'll just keep a close eye on the output and make sure it's coding right.

Thanks,

-todd

ADD REPLY

Login before adding your answer.

Traffic: 556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6