I have developing an RNA-seq pipeline and one of the steps of this is to annotation my genes. I am using Ensembl IDs and I would like my query to return me "ensemblgeneid", "hgncid", "hgncsymbol","description". Indeed, I need this information for later performing Gene set enrichment analysis.
I have been having problems with BiomaRT being slow and timing out, as described in this post: https://support.bioconductor.org/p/122412/#122533
So I am thinking of changing to a different annotation package. The appear to be many out there: https://www.bioconductor.org/packages/release/data/annotation/
I need a package that would have b38 data for humans and would have "ensemblgeneid", "hgncid", "hgncsymbol","description" information. I have found the following (below)
Package Maintainer Title
EnsDb.Hsapiens.v75 Johannes Rainer Ensembl based annotation package
EnsDb.Hsapiens.v79 Johannes Rainer Ensembl based annotation package
EnsDb.Hsapiens.v86 Johannes Rainer Ensembl based annotation package
However, I am concerned these packages may be outdated. Would anyone recommend a particular up to date package to replace BiomaRt? Has anyone had any experience similar to this?
Thanks.
biomaRt is supreme for annotation. If you have issues with your connection, then you could obtain an entire table via biomaRt on a once off basis, and then save and date-stamp it. You can then re-use that for annotating. This may actually be better because then you have version control in place.
For example, with this code, you can obtain a table that links Affy U133 probe IDs to ENSG ID, Biotype, and official gene symbols:
BiocFileCache might help implement this strategy, e.g., the use case 2.1 Local cache of an internet resource.
The caching implemented in biomaRt devel makes use of BiocFileCache to store results tables. It's really useful.
Thanks, I ended up using your solution and indeed it is much much faster to not have a specific query. Thanks!
That's weird, somehow this works now. Before, O was querying a specific subset with the getBM() function but when I didn't query a specific subset like above, it worked somehow..
But I am no longer getting the usual message: Batch submitting query [=======>-----------------------------------------------------]
Is that normal? I'm confused as to why this suddenly works..
You don't get the "batch submission" query because the batches are defined by the values you provide. If you don't provide this argument then it can't be split into batches.
Ensembl recommend a maximum of 500 values for each filter, otherwise queries can sometimes silently timeout on the server side and you get a truncated result table with no indication that this happened. There are examples of this if you search this forum, hence why the batch submission was introduced in the first place.
You can give the newest version of biomaRt a try, which will cache results for next time you run them, and store the temporary results for resuming a query that fails. You can install with
Bear in mind that these features are very new, so if it doesn't work properly please let me know here or even better raise an issue on GitHub.
You are right, the
EnsDb
databases that I provide as a package are outdated. But you can getEnsDb
databases fromAnnotationHub
for all Ensembl releases from version 87 on:So, there is the
"gene_id"
field (Ensembl gene ID),"symbol"
(HGNC Symbol),"description"
and"entrezid"
- I don't provide the HGNC ID directly, but you could extract that from the description.To get an overview of all
EnsDb
databases that are available: