Question

Transcript biotypes downloaded from Biomart R package were not categorized same with Ensembl website

0

Entering edit mode

llmiao • 0

@llmiao-20991

Last seen 4.9 years ago

Hi,

I downloaded all the transcripts and annotation with biotypes from a human dataset through the Biomart R package, the listed transcript biotypes including: protein coding, processed transcript, retained intron, pseudogene, processed pseudogene, etc. When I looked at the ensembl website for transcript biotypes (https://useast.ensembl.org/info/genome/genebuild/biotypes.html), I found retained intron was a subcategory of processed transcript, similar as processed pseudogene was a subcategory of pseudogene. So it looks the biotypes listed in the biomart package were not categorized same as the ensmbl database. Does anyone have any idea how the biotypes in Biomart R package was assigned for each transcript? Is there any reference to look for details/interpretation of the biotypes in Biomart?

Thanks!

Lingling

annotation • 798 views

ADD COMMENT • link 4.9 years ago llmiao • 0

0

Entering edit mode

Why do you say the annotation returned by biomaRt is different from what you see in the browser? Can you provide an example?

From reading the biotypes pages you link to, my assumption would be that if something is annotated as 'Retained intron' then it is implicitly also annotated as 'Long non-coding RNA (lncRNA)' and 'Processed transcript'.

The results returned by biomaRt are retrieved directly from Ensembl, so the assignment of annotation is already done and should be consistent with results found via other methods of accessing Ensembl data like the browser.

ADD REPLY • link 4.9 years ago Mike Smith ★ 6.5k

0

Entering edit mode

Thanks for your reply. The difference is not for a specific gene, it's about the category of the biotypes. I found some transcripts were annotated with "processed transcript" in biomaRt, while some others were annotated with "retained intron". I assume the ones that are annotated with "retained intron" are "processed transcript" as well, but for the transcripts that are annotated with "processed transcript", does this mean they are unclassified processed transcripts that cannot be placed in one of the other processed transcript sub-categories? Similarly as some transcripts were annotated with "pseudogene", while some others were annotated with "processed pseudogene", which I assume would also be "pseudogene". I may have to focus on those transcripts that were annotated with "processed transcript", but don't know how to interpret them, as I have others annotated with "retained intron" as well, which is more clear.

ADD REPLY • link 4.9 years ago llmiao • 0