Differences in ENSEMBL => ENTREZID mapping coverage between org.Hs.eg.db and AnnotationHub resources
Entering edit mode
sandmann.t ▴ 70
Last seen 13 days ago
United States

I would like to create a data.frame with both ENSEMBL and ENTREZID gene identifiers for several species, including the human and the Macaca fascicularis genomes.

I noticed that the latest EnsDb object available (AH100643 = ensemble version 106) for the human genome only contains ENTREZIDs for ~ 40% of the ENSEMBL gene identifiers. (See reproducible example below.) In contrast, the org.Hs.eg.db organism annotation package has ENSEMBL => ENTREZID mappings for ~ 55% of (the same) ENSEMBL genes.

For the human genome, I can chose to use the org.Hs.eg.db package as the source of annotations (because it seems to be more comprehensive). But for Macaca fascicularis, I don't have that option because no annotation package exists in Bioconductor 3.15.

There is the Macaca fascicularis AH100664 resource (ensembl v106) in the AnnotationHub. Yet, based on my experience with the human annotations, I am wondering if I am missing ENSEMBL => ENTREZID mappings by relying on it?

Is there a better way to retrieve the most comprehensive gene identifier mappings for this species? Or perhaps I am not using the EnsDb object from the AnnotationHub provides correctly?

Many thanks for any pointers and suggestions!

Reproducible example


kResource <- "AH100643" # ensembl v106, Homo sapiens
kCacheDir <- getAnnotationHubOption("CACHE")

dir.create(kCacheDir, showWarnings = FALSE, recursive = TRUE)
ah = AnnotationHub(cache = kCacheDir)
# snapshotDate(): 2022-04-21


# AnnotationHub with 1 record
# snapshotDate(): 2022-04-21
# names(): AH100643
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2022-04-21
# $title: Ensembl 106 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensembl ...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("106", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH100643"]]' 

ensdb <- ah[[kResource]]
gene_ids <- select(ensdb, keys(ensdb), 
                   columns = "ENTREZID",
                   keytype = "GENEID")

# ~ 60% of the ENSEMBL gene identifiers are not mapped to ENTREZIDs
# 27312 42296 

# for example ENSEMBL gene ENSG00000001629 (ANKIB1) without an ENTREZID
gene_ids[gene_ids$GENEID == "ENSG00000001629", ]
# 17 ENSG00000001629       NA

# in contrast, the org.Hs.eg.db Bioconductor annotation package
# maps many more ensembl identifiers to their ENTREZIDs
from_bioc <- select(org.Hs.eg.db, keys(ensdb),
                    columns = "ENTREZID", keytype = "ENSEMBL")
# 39007 30705 

# For example, ENSEMBL gene ENSG00000001629 is mapped in org.Hs.eg.db
select(org.Hs.eg.db, "ENSG00000001629", columns = "ENTREZID",
       keytype = "ENSEMBL")
# 1    54467 ENSG00000001629

Session Info

R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.15.0     ensembldb_2.20.2        AnnotationFilter_1.20.0
 [4] GenomicFeatures_1.48.3  GenomicRanges_1.48.0    GenomeInfoDb_1.32.2    
 [7] AnnotationHub_3.4.0     BiocFileCache_2.4.0     dbplyr_2.2.1           
[10] AnnotationDbi_1.58.0    IRanges_2.30.0          S4Vectors_0.34.0       
[13] Biobase_2.56.0          BiocGenerics_0.42.0    

loaded via a namespace (and not attached):
 [1] MatrixGenerics_1.8.1          httr_1.4.3                   
 [3] bit64_4.0.5                   shiny_1.7.2                  
 [5] assertthat_0.2.1              interactiveDisplayBase_1.34.0
 [7] BiocManager_1.30.18           blob_1.2.3                   
 [9] renv_0.15.5                   GenomeInfoDbData_1.2.8       
[11] Rsamtools_2.12.0              yaml_2.3.5                   
[13] progress_1.2.2                BiocVersion_3.15.2           
[15] lattice_0.20-45               pillar_1.7.0                 
[17] RSQLite_2.2.15                glue_1.6.2                   
[19] digest_0.6.29                 promises_1.2.0.1             
[21] XVector_0.36.0                Matrix_1.4-1                 
[23] htmltools_0.5.2               httpuv_1.6.5                 
[25] XML_3.99-0.10                 pkgconfig_2.0.3              
[27] biomaRt_2.52.0                zlibbioc_1.42.0              
[29] purrr_0.3.4                   xtable_1.8-4                 
[31] later_1.3.0                   BiocParallel_1.30.3          
[33] tibble_3.1.7                  KEGGREST_1.36.2              
[35] generics_0.1.3                ellipsis_0.3.2               
[37] SummarizedExperiment_1.26.1   cachem_1.0.6                 
[39] lazyeval_0.2.2                cli_3.3.0                    
[41] magrittr_2.0.3                crayon_1.5.1                 
[43] mime_0.12                     memoise_2.0.1                
[45] fansi_1.0.3                   xml2_1.3.3                   
[47] tools_4.2.1                   prettyunits_1.1.1            
[49] hms_1.1.1                     matrixStats_0.62.0           
[51] BiocIO_1.6.0                  lifecycle_1.0.1              
[53] stringr_1.4.0                 DelayedArray_0.22.0          
[55] Biostrings_2.64.0             compiler_4.2.1               
[57] rlang_1.0.3                   grid_4.2.1                   
[59] RCurl_1.98-1.7                rstudioapi_0.13              
[61] rjson_0.2.21                  rappdirs_0.3.3               
[63] bitops_1.0-7                  restfulr_0.0.15              
[65] codetools_0.2-18              DBI_1.1.3                    
[67] curl_4.3.2                    R6_2.5.1                     
[69] GenomicAlignments_1.32.0      dplyr_1.0.9                  
[71] rtracklayer_1.56.1            fastmap_1.1.0                
[73] bit_4.0.4                     utf8_1.2.2                   
[75] filelock_1.0.2                ProtGenerics_1.28.0          
[77] stringi_1.7.8                 parallel_4.2.1               
[79] Rcpp_1.0.9                    vctrs_0.4.1                  
[81] png_0.1-7                     tidyselect_1.1.2
AnnotationHub org.Hs.eg.db ensembldb • 101 views
Entering edit mode
Last seen 1 day ago
United States

This is a longstanding issue that has to do with how each annotation service defines what a gene is, where they are, and whether or not a given gene in the same relative position for both services is in fact the same gene. Actually, it's the transcripts, as a gene is a simplification of the underlying transcripts and is in some sense not really a thing.

Anyway, NCBI and EBI have been working on something called MANE, where they are trying to find one (!) transcript per gene that they think is the same thing for both annotation services. It's taken years, which is a testament to how hard it is to do and is also a clear indication of what we really know about the genome and genes and whatnot. You can get the data from either NCBI or EBI - I'll get it from NCBI.

> download.file("https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.summary.txt.gz", "MANE.GRCh38.v1.0.summary.txt.gz")
trying URL 'https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.summary.txt.gz'
Content type 'application/x-gzip' length 1083306 bytes (1.0 MB)
downloaded 1.0 MB

## can't read in directly because there are embedded nulls
> system("gzip -d MANE.GRCh38.v1.0.summary.txt.gz")
[1] 0
> z <- read.delim("MANE.GRCh38.v1.0.summary.txt")
> head(z)
  X.NCBI_GeneID       Ensembl_Gene   HGNC_ID   symbol                      name
1      GeneID:1 ENSG00000121410.12    HGNC:5     A1BG    alpha-1-B glycoprotein
2      GeneID:2 ENSG00000175899.15    HGNC:7      A2M     alpha-2-macroglobulin
3      GeneID:9 ENSG00000171428.15 HGNC:7645     NAT1     N-acetyltransferase 1
4     GeneID:10  ENSG00000156006.5 HGNC:7646     NAT2     N-acetyltransferase 2
5     GeneID:12 ENSG00000196136.18   HGNC:16 SERPINA3  serpin family A member 3
6     GeneID:13 ENSG00000114771.14   HGNC:17    AADAC arylacetamide deacetylase
   RefSeq_nuc RefSeq_prot        Ensembl_nuc      Ensembl_prot MANE_status
1 NM_130786.4 NP_570602.2  ENST00000263100.8 ENSP00000263100.2 MANE Select
2 NM_000014.6 NP_000005.3 ENST00000318602.12 ENSP00000323929.8 MANE Select
3 NM_000662.8 NP_000653.3  ENST00000307719.9 ENSP00000307218.4 MANE Select
4 NM_000015.3 NP_000006.2  ENST00000286479.4 ENSP00000286479.3 MANE Select
5 NM_001085.5 NP_001076.2  ENST00000393078.5 ENSP00000376793.3 MANE Select
6 NM_001086.3 NP_001077.2 ENST00000232892.12 ENSP00000232892.6 MANE Select
  GRCh38_chr chr_start   chr_end chr_strand
1         19  58345183  58353492          -
2         12   9067708   9115919          -
3          8  18210109  18223689          +
4          8  18391282  18401218          +
5         14  94612391  94624053          +
6          3 151814116 151828488          +

That's probably not complete, but it will most likely be superior to what you get from the OrgDb or EnsDb packages.

Entering edit mode

There are 260 genes that NCBI can't/hasn't yet mapped to Ensembl

> download.file("https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.genes_not_in_mane.txt.gz", "MANE.GRCh38.v1.0.genes_not_in_mane.txt.gz")
trying URL 'https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.genes_not_in_mane.txt.gz'
Content type 'application/x-gzip' length 3421 bytes
downloaded 3421 bytes

> system("gzip -d MANE.GRCh38.v1.0.genes_not_in_mane.txt.gz")
[1] 0
> zz <- read.delim("MANE.GRCh38.v1.0.genes_not_in_mane.txt")
> dim(zz)
[1] 260   4
> head(zz)
   X.GeneID    HGNC_id gene_symbol                                     status
1        28    HGNC:79         ABO      genome error on assembled chromosomes
2    222611 HGNC:18991      ADGRF2                        pending MANE review
3    246181 HGNC:24056       AKR7L non-coding allele on assembled chromosomes
4 100913187 HGNC:44196  APOBEC3A_B          gene not on assembled chromosomes
5     89839 HGNC:15782   ARHGAP11B                        pending MANE review
6    343578 HGNC:16226    ARHGAP40                        pending MANE review

> table(zz$status)

          gene in false duplication region 
      gene located on mitochondrial genome 
         gene not on assembled chromosomes 
         gene undergoes ribosomal slippage 
     genome error on assembled chromosomes 
non-coding allele on assembled chromosomes 
                       pending MANE review 

Login before adding your answer.

Traffic: 213 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6