First, an OrgDb
isn't tied to a particular genome build, because none of the information in that package is intended to be tied to any genomic position. There are (still) genomic positions in those objects, but they are meant to have been removed, but that has yet to occur.
Second, the name of the OrgDb
is intended to inform you of the provenance of those data. So org.Hs.eg.db is supposed to inform you that it's an OrgDb
for Homo sapiens, based on Entrez Gene (what NCBI Gene IDs used to be called). And the last part is what matters here. The central ID for these packages is the NCBI Gene ID, and all mappings are based on those IDs. So if you ask for the HUGO symbol for an Ensembl ID, what ends up happening is you first map the Ensembl ID to its corresponding NCBI Gene ID, and then the Gene ID is mapped to the correct HUGO symbol.
First try the OrgDb
> library(org.Hs.eg.db)
Loading required package: AnnotationDbi
> select(org.Hs.eg.db, "ENSG00000217120", c("SYMBOL","ENTREZID"), "ENSEMBL")
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.
As you already know, no mappings there. Let's try biomaRt
> library(biomaRt)
> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl", mirror = "useast")
> getBM(c("hgnc_symbol","entrezgene_id","ensembl_gene_id"), "ensembl_gene_id", "ENSG00000217120", mart)
hgnc_symbol entrezgene_id ensembl_gene_id
1 NA NA ENSG00000217120
So you can see there is no NCBI Gene ID that corresponds to this Ensembl ID. But what about the HUGO symbol?
If you go to Ensembl, it appears there is a gene symbol (either Z98755.1, or CLPX) but if you thought that, you would be wrong. There are 'gene symbols' that people make up and stuff, but they aren't real gene symbols! Those come from HUGO, and according to that resource, there isn't a symbol for this pseudogene. Probably because of the pseudo part I would imagine.
Great, thank you so much, this makes a lot of sense. So has the name "pseudogene" something to do with that it has no "real symbol"?
Of course. A pseudogene is just a section of the genome that resembles a real gene. Hence the pseudo part. Why bother giving a gene symbol to a thing that isn't really thought of as being a real gene?
Great thanks!