How to reverse translate amino acid sequence into mammalian codon optimised cDNA
4
0
Entering edit mode
@tomasbjorklund-7071
Last seen 7.2 years ago
Sweden

I have an issue that has been surprisingly difficult to find any answer to on the net, even though it is very straight forward.

I have a list of a large number of amino acid sequences (one letter per AA). I need to convert them into cDNA sequences that are mammalian codon optimised. Can anyone help me with finding the suitable function to do that? (Just the reverse translation is required).

Thanks'

/Tomas

convert translation • 4.7k views
0
Entering edit mode

I'm not sure that you are correct that 'it is very straightforward', which is likely why you can't find functionality to do this. Given that each amino acid has a one-to-many association with the codons that encode for it, how do you propose doing the reverse mapping?

As an example, let's consider a simple 3 amino acid sequence, FLP. If I go to someplace like (http://www.genscript.com/cgi-bin/tools/codon_freq_table) and get the human codon frequency table, I get

 TTT F 0.45 TTC F 0.55 TTA L 0.07 TTG L 0.13 CTT L 0.13 CTC L 0.2 CTA L 0.07 CTG L 0.41 CCT P 0.28 CCC P 0.33 CCA P 0.27 CCG P 0.11

So unless you are going to make an (unwarranted, IMO) simplifying assumption that you can just use the most common AA -> codon mapping, you have 40 different possible sequences that could have given rise to a simple 3 amino acid sequence. Obviously as the amino acid sequence gets longer, the possible cDNA sequences that could give rise to the amino acid sequence blows up massively.

It seems that for any reasonably long amino acid sequence you would then either get some massive number of possible cDNA sequences (not likely useful), or a single sequence that has a probability somewhere around 1/<some massive number> of being the right one. Neither outcome seems very useful to me.

0
Entering edit mode
@valerie-obenchain-4275
Last seen 4 months ago
United States

Hi Tomas,

We don't have a reverse translate function in Bioconductor, at least not one that's exported. It's possible Herve wrote a similar helper at one point. If that's the case I'm sure he'll post.

I don't think we've had a request for this function before. I'm interested in hearing if others have this same need ... ? Are you be looking for a consensus sequence derived from all possible codons? only sequences from non-degenerate codons? i.e., similar to this tool,

http://www.bioinformatics.org/sms2/rev_trans.html

FYI, the low level objects in Biostrings used in the forward translation might be of interest if you want to experiment with a reverse prototype.

library(Biostrings)

?IUPAC_CODE_MAP

?GENETIC_CODE

Valerie

0
Entering edit mode
@tomasbjorklund-7071
Last seen 7.2 years ago
Sweden

Hi Valerie and James,

Thank you both for your helpful answers. I think that I may need to give a little bit of more background on what I need to achieve to make it easier to understand. We are studying short polypeptides derived from proteins expressed by a large number of viruses. We are building large systematic assay systems where we express these polypeptides (approx. 45aa long) using viral vectors in mammalian cells. We build the libraries using custom microarrays which can generate 100 000 oligonucleotides (200bp long) that then are put into the viral vector expression system.

The challenge is this: While 100 000 sounds a lot, it is actually not that many considering the number of viral strains and proteins we wish to express. Therefore, we need to make sure that we do not have unnecessary redundancy in the library, i.e., two genetic sequences that translate into identical polypeptides. Unfortunately, many viral strains have high genetic diversity while coding for highly conserved proteins. Thus if we were to only fragment the DNA into suitable length pieces and sorting out identical duplicates, we would have much more than 100 000 gene sequences and identical polypeptides would be expressed at a higher abundance than those that are actually different. In addition, some of the viruses are not mammalian viruses and thus, there is no guarantee that these DNA sequences would efficiently translate into proteins in mammalian cells.

So the situation is not at all that I need to figure out the original DNA sequence from an AA sequence (I realise that this would be impossible) instead, what we need to generate are cDNA sequences that would translate with sufficient efficiency into the target polypeptides in mammalian cells.

For this, I would myself see one possible process; The first step would be as James suggest to translate 1AA into one codon, based on the human codon frequency table. After that I would then run a mammalian codon optimisation on the entire generated sequence similarly to what Genscript and other gene synthesis companies offer.

It is a function like this that I was looking for, as I have very little knowledge in the codon optimisation principles. The first part of the conversion I can clearly write myself.

I hope that this made my question a little clearer.

Thank you again!

/Tomas

0
Entering edit mode
@tomasbjorklund-7071
Last seen 7.2 years ago
Sweden

Hi again,

It seems like the second part, the codon optimisation, maybe could be achieved by GeneGA in Bioconductor. Is someone familiar with this and could recommend it?

0
Entering edit mode
caroline • 0
@caroline-7721
Last seen 4.2 years ago

Mammalian cell expression systems are the best choice for the production of eukaryotic proteins, especially when correct folding and post-translational modification (glycosylation, phosphorylation, etc.) is required. They produce eukaryotic recombinant proteins in the most natural state, with native tertiary structure, physiochemical characteristics, and bioactivities. They have been successfully applied in the biopharmaceutical production of cytokines, monoclonal antibodies, growth factors and so on. The most widely-used mammalian cell lines are HEK293 and CHO cells.