Dear all,
I am looking for a way to retrieve a protein's coding sequence, and only the one that translates into a biologically functioning protein directly using BioMart and the getsequence function. For example, when using:
cds_seq = getSequence(id = "NM_004974",
type = "refseq_mrna",
seqType = "coding",
mart = mart)
I get a data frame with 8 different sequences. However, I only want the one that translates into the proper protein. Is there a way to do this?
Dear James,
thanks for your feedback! If I use the proposed R code I get the following result:
How can I determine which one is the one that properly translates into a protein?
Dear James,
thanks for your feedback! If I use the proposed R code I get the following result:
How can I determine which one is the one that properly translates into a protein?
You don't need to post the same thing twice. You should also provide more context. How did you generate your
mart
object? What is the result of runningsessionInfo()
after getting your results?Anyway, you show 7 sequences, all of which have a start codon. Why do you think that only one is 'properly translated'?
As I have already shown, I only get one sequence, so I have no idea why you would get seven. BUT I also pointed out that there are issues when trying to map transcripts from NCBI to EBI/EMBL. But there does seem to be a 1:1 correspondence, and if I do
I still have no idea why you get multiple transcripts. I just get one, regardless. So it would be useful to first figure out why you get seven and I get one.
I think this is because I've introduce a bug in the developmental version of biomaRt trying to respond to https://github.com/grimbough/biomaRt/issues/33
The Ensembl BioMart will let you query with a filter of
refseq_mrna
but it won't let you retrieve that ID type in as an attribute. I think this was causing the unstable and broken results in the Github issue. I tried to work around this by mapping on to Ensembl Gene IDs internally, but I see now that is obviously not working as I expected when the original identifier is mRNA based. I'll take another look the code and try to come up with something more robust.Thanks, Mike! That is interesting - a colleague has a non-developer version of BioMart installed and he does not get the error. So I guess we're on the right way to solve the problem! Here is the information that was missing in my initial post - sorry about that! I generate my mart object as follows
You aren't using the devel version of
biomaRt
. Instead you are using an old version that you should update. The current release version of Bioconductor runs on R-4.0.2, and yourbiomaRt
version should be 2.46.0.Which makes this all the more confusing, as we have three versions of biomaRt in play, that all query the same database, and produce quite different results. I also get 8 sequences with the devel version, but only 1 if I run the query in the Ensembl web interface.
I'll take a longer look at this tomorrow.
I updated both my R version and the biomaRt version as James suggested - and now it seems to give a single results for my query which looks fine! Still I am unsure what caused the problem...