Apparent Ensembl - Entrez gene mis-match in
Entering edit mode
abf ▴ 30
Last seen 16 months ago
United States

I need to map mouse Ensembl gene ID's to their corresponding Entrez Gene ID's. In the process of reviewing cases of multi-mapping ID's, I came across a few examples where the wrong Entrez ID is assigned to a corresponding Ensembl ID.

For Example: Mismatch for Zinc Finger genes

Are all recognized by NCBI, and in most cases the NCBI gene page recognizes the correct Ensembl annotation:

  • Zfp966 Is Entrez ID 667962, which NCBI recognizes as Ensembl ID: ENSMUSG00000089756
  • Zfp968 Is Entrez ID 100043914, which NCBI recognizes as Ensembl ID: ENSMUSG00000078898
  • Zfp967 Is Entrez ID 100303732, which NCBI recognizes as Ensembl ID: ENSMUSG00000095199

Another examples include:

  • Ccl27a, Entrez ID 20301 which NCBI recognizes as Ensembl ID: ENSMUSG00000073888 enter image description here


  • Ndufb4, Entrez ID 68194 which NCBI recognizes as Ensembl ID: ENSMUSG00000022820 enter image description here

If there is an error in my query (below), please let me know.

## Get Unique Ensembl Gene ID's from differential expression analysis
ens_mm_gid <- deg_master %>%
  filter(grepl("ENSMUS", gene_id))%>%
  pull("gene_id") %>% unique()

## Query with Ensembl Gene ID Keys
ens_mm_entrez <- AnnotationDbi::select(,
  columns = c("ENTREZID","SYMBOL","GENENAME"),
  keys = ens_mm_gid,
  keytype = "ENSEMBL"

## Evaluate cases where 2 or more Ensembl ID's are assigned the same Entrezid
ens_mm_entrez %>%
    mouse_an %>% select(gene_id, Ens_SYMBOL=SYMBOL),
  filter(! %>%
  group_by(ENTREZID) %>%
  filter(n() > 1) %>%
  ) %>%

sessionInfo( )
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] stats4    grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] openxlsx_4.2.4       dplyr_1.0.7  AnnotationDbi_1.52.0 IRanges_2.24.1       S4Vectors_0.28.1     Biobase_2.50.0       ROntoTools_2.18.0   
 [9] Rgraphviz_2.34.0     KEGGgraph_1.50.0     KEGGREST_1.30.1      boot_1.3-28          graph_1.68.0         BiocGenerics_0.36.1  synapser_0.10.89     edgeR_3.32.1        
[17] limma_3.46.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7            locfit_1.5-9.4        lattice_0.20-44       tidyr_1.1.3           png_0.1-7             Biostrings_2.58.0     assertthat_0.2.1      packrat_0.6.0        
 [9] utf8_1.2.1            R6_2.5.0              RSQLite_2.2.7         httr_1.4.2            pillar_1.6.1          zlibbioc_1.36.0       rlang_0.4.11          rstudioapi_0.13      
[17] blob_1.2.1            RCurl_1.98-1.3        bit_4.0.4             compiler_4.0.5        pkgconfig_2.0.3       pack_0.1-1            tidyselect_1.1.1      tibble_3.1.2         
[25] codetools_0.2-18      XML_3.99-0.6          fansi_0.5.0           crayon_1.4.1          bitops_1.0-7          lifecycle_1.0.0       DBI_1.1.1             magrittr_2.0.1       
[33] zip_2.2.0             cli_3.0.1             stringi_1.7.3         cachem_1.0.5          PythonEmbedInR_0.7.80 XVector_0.30.0        ellipsis_0.3.2        generics_0.1.0       
[41] vctrs_0.3.8           tools_4.0.5           bit64_4.0.5           glue_1.4.2            purrr_0.3.4           fastmap_1.1.0         memoise_2.0.0 AnnotationDbi • 1.2k views
Entering edit mode

I should note that when I query using the gene Symbols associated with my differential expression data I find a 1:1 correspondence between a symbol and an Entrez ID, at least for the set of genes that matter in our study. I was always taught that database accession's were more reliable unique identifiers than gene symbols and should be preferred in bioinformatic analyses. Perhaps this is a position I aught to reconsider?

Entering edit mode
Last seen 7 minutes ago
United States
> select(, "100043914", "ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
   ENTREZID            ENSEMBL
1 100043914 ENSMUSG00000078898
> select(, paste0("ENSMUSG000000", c(95545,89756,78898,95199)), "ENTREZID","ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
             ENSEMBL  ENTREZID
1 ENSMUSG00000095545 100043915
2 ENSMUSG00000089756    667962
3 ENSMUSG00000078898 100043914
4 ENSMUSG00000095199 100303732

> library(biomaRt)
> mart <- useEnsembl("ensembl","mmusculus_gene_ensembl")
> getBM(c("ensembl_gene_id","entrezgene_id"), "ensembl_gene_id", paste0("ENSMUSG000000", c(95545,89756,78898,95199)), mart)
     ensembl_gene_id entrezgene_id
1 ENSMUSG00000078898     100043914
2 ENSMUSG00000089756        667962
3 ENSMUSG00000095199     100303732
4 ENSMUSG00000095545     100043915

You just show some of your code, without really saying where you got e.g., ens_mm_gid, so it's hard to say why you get what you get. Plus I don't speak tidyverse so can't say anything about your code. But I can't get the same results as you, for the first four genes.

That said, it's my (oft repeated) contention that mapping between the two annotation sources is mostly unnecessary, and should be avoided if possible. There are any number of subtle reasons why the two services disagree, and they have yet to come to a good agreement for human genes, let alone mouse genes. Trying to naively map IDs between the two when the experts for each service can't agree seems like a fraught exercise, and I have never really seen what one might gain by doing so.


Login before adding your answer.

Traffic: 853 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6