Hi Valerie and James,
Thank you both for your helpful answers. I think that I may need to give a little bit of more background on what I need to achieve to make it easier to understand. We are studying short polypeptides derived from proteins expressed by a large number of viruses. We are building large systematic assay systems where we express these polypeptides (approx. 45aa long) using viral vectors in mammalian cells. We build the libraries using custom microarrays which can generate 100 000 oligonucleotides (200bp long) that then are put into the viral vector expression system.
The challenge is this: While 100 000 sounds a lot, it is actually not that many considering the number of viral strains and proteins we wish to express. Therefore, we need to make sure that we do not have unnecessary redundancy in the library, i.e., two genetic sequences that translate into identical polypeptides. Unfortunately, many viral strains have high genetic diversity while coding for highly conserved proteins. Thus if we were to only fragment the DNA into suitable length pieces and sorting out identical duplicates, we would have much more than 100 000 gene sequences and identical polypeptides would be expressed at a higher abundance than those that are actually different. In addition, some of the viruses are not mammalian viruses and thus, there is no guarantee that these DNA sequences would efficiently translate into proteins in mammalian cells.
So the situation is not at all that I need to figure out the original DNA sequence from an AA sequence (I realise that this would be impossible) instead, what we need to generate are cDNA sequences that would translate with sufficient efficiency into the target polypeptides in mammalian cells.
For this, I would myself see one possible process; The first step would be as James suggest to translate 1AA into one codon, based on the human codon frequency table. After that I would then run a mammalian codon optimisation on the entire generated sequence similarly to what Genscript and other gene synthesis companies offer.
It is a function like this that I was looking for, as I have very little knowledge in the codon optimisation principles. The first part of the conversion I can clearly write myself.
I hope that this made my question a little clearer.
Thank you again!