First off, you are using a years-old version of R and Bioconductor (that object you are using is specific to Bioc 3.13, and we are on Bioc 3.16). You should update.
If you mean to convert gene symbols to NCBI Gene IDs, you want SYMBOL, not GENENAME.
> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2022-10-26
> query(hub, c("orgdb","solanum"))
AnnotationHub with 8 records
# snapshotDate(): 2022-10-26
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Solanum verrucosum, Solanum tuberosum, Solanum stenotomum, Solan...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH107616"]]'
title
AH107616 | org.Solanum_stenotomum.eg.sqlite
AH107725 | org.Solanum_verrucosum.eg.sqlite
AH107737 | org.Solanum_pennellii_Correll,_1958.eg.sqlite
AH107738 | org.Solanum_pennellii.eg.sqlite
AH107800 | org.Solanum_tuberosum.eg.sqlite
AH107911 | org.Solanum_esculentum.eg.sqlite
AH107912 | org.Solanum_lycopersicum.eg.sqlite
AH107913 | org.Solanum_lycopersicum_var._humboldtii.eg.sqlite
> orgdb <- hub[["AH107912"]]
> head(keys(orgdb, "GENENAME"))
[1] "(+)-menthofuran synthase-like"
[2] "(+)-neomenthol dehydrogenase"
[3] "(+)-neomenthol dehydrogenase-like"
[4] "(-)-camphene/tricyclene synthase, chloroplastic"
[5] "(-)-camphene/tricyclene synthase, chloroplastic-like"
[6] "(-)-germacrene D synthase"
GENENAME, is as the name suggests, the name of the gene. SYMBOL is the official gene symbol.
Thank you! I have updated the versions. I used "GENENAME" because the gene list that I am using has genes such as Solyc05g012020. Could you help on what will be the right key here?
That's almost an Ensembl Gene ID
So if you have the version numbers as well, you should be able to map. Usually without the .3 it's the stable ID, but maybe it's different for plants.
Also, you already have the Ensembl IDs, so maybe you can go forward with that?
Thank you! So I can extract entrez ids using this and then run downstream analysis. I also tried using an object for a list of genes to use in getBM but it does not give any output without an error.
Can I use a gene list of Ensembl ids for getBM to get entrez ids? Thank you!
Well, you are probably going to have to do something more brute force if you don't have the version numbers.
And then you can use
match
to line up the rows correctly. It can be tricky, so here's an example.This also illustrates my longstanding contention that you shouldn't map between annotation services unless absolutely necessary. Those NA values represent things that EBI/EMBL say are genes, but for whatever technical reason are not mappable to the corresponding NCBI Gene ID. There may be NCBI Gene IDs that should map, but it's a rules based process, and if the corresponding NCBI Gene doesn't meet all the criteria, it's a no match.
Thank you! Now I have an object for entrez ids. I have excluded NAs (the genes with no entrez ids). I also have an object for universe which is basically the list of expressed genes to use for enrichGO.
But I get this error:
It's telling you that none of the GO gene sets are between 10 and 500. Which is plausible, as CC is a relatively barren GO DAG. There are only 8654 genes that are appended to GO CC, and of those, only 268 that are appended to GO terms between 10 and 500. In comparison there are 8347 genes appended to GO BP terms, and of those there are 1201 with sets that are between 10 and 500 genes.
And, the GO annotation for tomato is pretty bad. There are 31328 genes in that orgdb, and less than a third have GO terms appended. So it's going to be problematic all the way around. You could lower the minimum gene set size, but that won't fix the overarching issue.
Blast2GO might have better annotations, but that's locked behind a paywall, which is why we no longer use their data. But maybe you could get a trial subscription.