OrgDb (org.Hs.eg.db) missing many EntrezIDs and their Symbols (Compared to Ensembl GRChg8 v103)
2
0
Entering edit mode
Nathan • 0
@b385bbfe
Last seen 3 months ago
United States

I'm using ArchR to analyze a H.sapiens PBMC scATAC dataset I have and decided to use Ensembl's GRCh38, Release 103 genome as my reference.

In order to do this I needed to use ArchR's createGenomeAnnotation & createGeneAnnotation and to define the genome. For createGeneAnnotation an OrgDb object was need which I used AnnotationHub to access.

Once I loaded in this OrgDb I realized there were GENEIDs (EntrezIDs) present in my GTF for GRCh38, Release 103 that were missing from the OrgDb. I thought this meant it wasn't up to date, but realized only one was returned when I queried annotation hub for it like this query(hub, c("Homo sapiens","OrgDb")).

Is there somewhere I can get a more up to date version of the H.sapiens OrgDb? I was under the impression it was regularly updated so had trouble believing that it was missing so many IDs

Its resulting in detected marker genes being labeled as NA_<GeneIDPresentInGTF>

Any and all guidance would be greatly appreciated

AnnotationHub OrganismDbi org.Hs.eg.db • 432 views
0
Entering edit mode

Cross-posted to Biostars https://www.biostars.org/p/9513406/

2
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

The organism package org.Hs.eg.db is updated twice a year in March and September, a week or two before each Bioconductor release. The current version of Org.Hs.eg.db is dated 15 September 2021.

org.Hs.eg.db is 100% comprehensive in that it contains all Entrez IDs that exist at the time it is created. However new Entrez Ids are regularly created.

You can get the definitive list of Entrez IDs at any time by downloading the Homo_sapiens.gene_info.gz file from the NCBI ftp. Going back through my records I see that there were 61,760 Entrez Ids on 26 January 2021:

> NCBI.210126 <- read.delim("210126-Homo_sapiens.gene_info.gz")
> dim(NCBI.210126)
[1] 61760    16


and 64,503 Entrez IDs on 4 November 2021:

> NCBI.211104 <- read.delim("211104-Homo_sapiens.gene_info.gz")
> dim(NCBI.211104)
[1] 64503    16


The org.Hs.eg.db package contains 63,901 Entrez IDs:

> library(org.Hs.eg.db)
> length(Lkeys(org.Hs.egSYMBOL))
[1] 63901


which seems comprehensive for 15 September 2021.

0
Entering edit mode

Would you say its best to use a set of annotations containing Entrez IDs along with any OrgDb object? I was using a set of annotations with Ensembl IDs and got many cases of genes being marked as: NA_{EnsemblID} as shown above. I thought the point of OrgDb objects was to consolidate the different IDs given to features across institutions/databases so it struck me as strange that many genes detected as marker genes didn't have a row in Org.Hs.eg.db. While I think this was partially due to OrgDb's periodic updates, a large proportion of marker genes (30% in some clusters) followed the pattern NA_{EnsemblID}.

Do you think this was because genes in my Ensembl gtf existed which weren't associated with EntrezIDs at time of OrgDb release? OrgDb objects have comprehensive coverage of EntrezIDs but not of other sets like Ensembl correct?

1
Entering edit mode

The NCBI (Entrez ID) and Ensembl gene annotations are basically incompatible. Both databases contain virtually all well-recognized genes but the two annotations do not match up perfectly meaning that you cannot map IDs from one to the other reliably. If you try to map Ensembl IDs to Entrez you will generally end up with a high proportion of missing values. This remains true even if the same genes are in both databases, simply because IDs in the two databases don't line up exactly so that an ID in one can be mapped exactly and uniquely to an ID in the other databse.

I am not an author of an organism package and I can't speak for the intentions of the author, but the purpose of org.Hs.eg.db is certainly not to consolidate different IDs across databases because that is impossible. org.Hs.eg.db is very clearly an NCBI-centric annotation package even though it gives the Ensembl IDs that it can.

In my opinion, you need to use either Ensembl or NCBI gene annotation, but not both. It is not realistic to expect to use the two gene ID systems interchangeably. In my opinion, if you are using an Ensembl or Gencode GTF then the Bioconductor organism packages like org.Hs.eg.db will not be useful to you, at least I would not use them. You need to instead use Ensembl-centric annotation such as BioMart. I use org.Hs.eg.db regularly myself but that is because I use an NCBI GTF.

0
Entering edit mode

The Ensembl annotation you seem to be using (http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens/) dates from December 2020, so the org.Hs.eg.db package is actually more up-to-date rather than the other way around. The Ensembl GTF does not contain Entrez IDs, only Ensembl IDs.

1
Entering edit mode
@james-w-macdonald-5106
Last seen 15 hours ago
United States

It's not clear from your post what you are missing. There aren't any NCBI Gene IDs there, just some HGNC gene symbols, some of which are replaced with Ensembl Gene IDs with an NA_ prepended.

And the first one is some random lncRNA, which won't have a name anyway