Question

Reverse translate from protein sequences into the most likely non-degenrate DNA coding sequence

0

Entering edit mode

moldach ▴ 20

@moldach-8829

Last seen 5.6 years ago

Canada/Montreal/Douglas Mental Health I…

I have saliva derived WGS data that I'm trying to remove all non-human contamination from. Two tools that I've found for this are DeconSeq and DecontaMiner. Both tools require known reference genomes for which you build a BWA database for alignment with BWA-SW.

To begin with, I used the Human Oral Microbiome Database (FASTA); however, my PI's suggestion was to do a more exhaustive use the NCBI non-redunant (nr) protein database.

The nr database comes as FASTA file; however, it's protein sequences and not nucleotides (which is required by DeconSeq/DecontaMiner). I know the Bioconductor package Biostrings contains the translate() function which translates DNA into amino acid sequences but it doesn't have a comparable function to reverse translate (back translate) from Protein -> DNA.

Is there an R package (or CLI tool) which can translate protein sequences into the most likely non-degenerate DNA coding sequence (for a 75 Gb file)?

Probably more importantly, I'm wondering if anyone sees any potential issue with this approach?

A particular protein follows from the translation of a DNA sequence whereas the reverse translation needs not have a specific solution according to the Genetic Code. The Genetic Code is degenerate which means that a particular amino acid can be translated into more than one codon.
As my project is specifically interested in variant calling for this project could errors not be introduced from the ambiguities of reverse translation?

I appreciate any feedback on this matter.

biostrings • 3.5k views

ADD COMMENT • link 5.7 years ago moldach ▴ 20

0

Entering edit mode

Are the nt FASTA files not acceptable? Seems like not reverse translating is simpler than doing so.

ADD REPLY • link 5.6 years ago James W. MacDonald 68k

0

Entering edit mode

a nuleotide FASTA is what is accepted by DeconSeq (and DecontaMiner) (see: http://deconseq.sourceforge.net/manual.html)

The nr database comes as FASTA file; however, it's protein sequences and not nucleotides (which is required by DeconSeq/DecontaMiner).

(base) [moldach@synergy ncbi-nr]$ head -50 nr.fa 
>EFG1759503.1 decarboxylating NADP(+)-dependent phosphogluconate dehydrogenase [Escherichia coli]
LKPYLDKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKEAYELVAPILTKIAAVAEDG
EPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLTNEELAQTFTEWNNGELSSYLIDITKDIFTKKDEDG
NYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKEQRVAASKVLSGPQAQPAGDKGEFIEKVRRALY
LGKIVSYAQGFSQLRAASEEYNWDLNYGEIAKIFRAGCIIRAQFLQKITDAYIENPQIANLLLAPYFKQIADNYQQALRE
VVAYAVQNGIPVPTFAAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRIDKEGVFHTEWL
>WP_137987990.1 inositol 2-dehydrogenase [Bacillus velezensis]TKZ18939.1 inositol 2-dehydrogenase [Bacillus velezensis]
MTDQVRCAVLGLGRLGYFHAKHLVSEVRGAELAAVCDPMKGKAETCAKELGIAKWTENPYDLLEDHTIDAVIIVTPTSTH
AEMIMKAAENGKHIFVEKPLTLSLEESKEVMKKIEETGVICQVGFMRRFDPAYADAKRRIDAGEIGRPIYYKGFTRDQGA
PPAEFIKHSGGLFIDCSIHDYDIARYLMNAEVTSVCGHGRILKHPFMEECGDVDQALTYLEFDSGAAGDVEASRNSPYGH
DIRAEIIGTAGSILVGTLRKSHVTILTESGSSYEIIPDFQARFKDAYRLELEHFAECVKKGEMPIVTDVDATINLEIGIA
ATESFKTGRPVKLTPGAFGYAGL

ADD REPLY • link 5.6 years ago moldach ▴ 20

0

Entering edit mode

Hi Matthew,

I'm not sure what the most likely non-degenrate DNA coding sequence is and how you're going to figure this out. As you say there are more than one way to reverse translate. So after you've managed to reverse translate the nr protein database and use it to align your WGS data against it, I suspect that a lot of your data is not going to align even though it would translate into a protein sequence present in the nr protein database.

For example, if your data contains the sequence AGAAAGGCCAACCCT, which translates into LKANP, and the nr protein database does contain LKANP, which your reverse translation procedure has replaced with AGGAAAGCTAATCCG, then your sequence won't align.

Another possible reason for false negative alignments is if you have sequences in your WGS data that come from untranslated regions.

I guess I don't really understand how aligning your data to the nr protein database is going to be an efficient/reliable way to identify non-human contamination. The more straightforward/traditional approach of aligning your data to a reference genome sounds more reliable to me.

H.

ADD REPLY • link 5.6 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks for the answer Herve, especially for giving an example of the reverse translation. Protein database is not the way to go.

From digging around a bit more I found that NCBI's RefSeq database provides a non-redundant collection of sequences representing genomic data, transcripts and proteins.

RefSeq has a genome FTP page from which you can download all of the genomes. In the case of DeconSeq you can provide both 1) database(s) to use for remove (i.e. every non-human genome) and 2) database to retain (i.e. human)

ADD REPLY • link 5.6 years ago moldach ▴ 20