Question: Transcript biotypes downloaded from Biomart R package were not categorized same with Ensembl website
I downloaded all the transcripts and annotation with biotypes from a human dataset through the Biomart R package, the listed transcript biotypes including: protein coding, processed transcript, retained intron, pseudogene, processed pseudogene, etc. When I looked at the ensembl website for transcript biotypes (, I found retained intron was a subcategory of processed transcript, similar as processed pseudogene was a subcategory of pseudogene. So it looks the biotypes listed in the biomart package were not categorized same as the ensmbl database. Does anyone have any idea how the biotypes in Biomart R package was assigned for each transcript? Is there any reference to look for details/interpretation of the biotypes in Biomart?



Why do you say the annotation returned by biomaRt is different from what you see in the browser? Can you provide an example?

From reading the biotypes pages you link to, my assumption would be that if something is annotated as 'Retained intron' then it is implicitly also annotated as 'Long non-coding RNA (lncRNA)' and 'Processed transcript'.

The results returned by biomaRt are retrieved directly from Ensembl, so the assignment of annotation is already done and should be consistent with results found via other methods of accessing Ensembl data like the browser.

Thanks for your reply. The difference is not for a specific gene, it's about the category of the biotypes. I found some transcripts were annotated with "processed transcript" in biomaRt, while some others were annotated with "retained intron". I assume the ones that are annotated with "retained intron" are "processed transcript" as well, but for the transcripts that are annotated with "processed transcript", does this mean they are unclassified processed transcripts that cannot be placed in one of the other processed transcript sub-categories? Similarly as some transcripts were annotated with "pseudogene", while some others were annotated with "processed pseudogene", which I assume would also be "pseudogene". I may have to focus on those transcripts that were annotated with "processed transcript", but don't know how to interpret them, as I have others annotated with "retained intron" as well, which is more clear.

