Hello I found 6 ENTREZ ID that give the wrong gene symbol and/or errors in the gene name from the org.Bt.eg.db database. ENTREZ Gene name Gene symbol
281667 cyclin A2 CC2 (its CCNA2) 505987 glycerol kise GK (it's glycerol kinase) 508132 N-acetylglucosamine kise GK (it's N-acetylglucosamine kinase; NAGK) 511042 nicotimide nucleotide adenylyltransferase 2 NMT2 (its NMNAT2) 786356 nos C2HC-type zinc finger 1 NOS1 (it's nanos C2HC-type zinc finger 1; NANOS1) 515499 proliferating cell nuclear antigen PC (it's PCNA)
code used was:
library(mygene)
genes<-queryMany(row.names(results), scopes="entrezgene", fields=c("symbol","name","entrezgene"), species="9913")
MapGenes<-data.frame(genes)
MapResults<-merge(results,MapGenes, by.x="row.names", by.y="query"); dim(MapResults)
session info: ```R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.6.8
Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.36.0 SummarizedExperiment_1.26.1 Biobase_2.56.0
[4] MatrixGenerics_1.8.1 matrixStats_0.62.0 GenomicRanges_1.48.0
[7] GenomeInfoDb_1.32.3 IRanges_2.30.0 S4Vectors_0.34.0
[10] BiocGenerics_0.42.0 edgeR_3.38.4 limma_3.52.2
loaded via a namespace (and not attached):
[1] fgsea_1.22.0 colorspace_2.0-3 ggtree_3.4.1
[4] qvalue_2.28.0 XVector_0.36.0 aplot_0.1.6
[7] rstudioapi_0.13 farver_2.1.1 graphlayouts_0.8.1
[10] ggrepel_0.9.1 bit64_4.0.5 AnnotationDbi_1.58.0
[13] fansi_1.0.3 scatterpie_0.1.7 codetools_0.2-18
[16] splines_4.2.1 cachem_1.0.6 GOSemSim_2.22.0
[19] geneplotter_1.74.0 polyclip_1.10-0 jsonlite_1.8.0
[22] annotate_1.74.0 GO.db_3.15.0 png_0.1-7
[25] ggforce_0.3.3 BiocManager_1.30.18 compiler_4.2.1
[28] httr_1.4.3 assertthat_0.2.1 Matrix_1.4-1
[31] fastmap_1.1.0 lazyeval_0.2.2 cli_3.3.0
[34] tweenr_1.0.2 tools_4.2.1 igraph_1.3.4
[37] gtable_0.3.0 glue_1.6.2 GenomeInfoDbData_1.2.8
[40] reshape2_1.4.4 DO.db_2.9 dplyr_1.0.9
[43] fastmatch_1.1-3 Rcpp_1.0.9 enrichplot_1.16.1
[46] vctrs_0.4.1 Biostrings_2.64.0 ape_5.6-2
[49] nlme_3.1-157 ggraph_2.0.6 stringr_1.4.0
[52] lifecycle_1.0.1 clusterProfiler_4.4.4 XML_3.99-0.10
[55] DOSE_3.22.0 zlibbioc_1.42.0 MASS_7.3-57
[58] scales_1.2.0 tidygraph_1.2.1 parallel_4.2.1
[61] RColorBrewer_1.1-3 memoise_2.0.1 gridExtra_2.3
[64] ggplot2_3.3.6 downloader_0.4 ggfun_0.0.6
[67] yulab.utils_0.0.5 stringi_1.7.8 RSQLite_2.2.15
[70] genefilter_1.78.0 tidytree_0.3.9 BiocParallel_1.30.3
[73] rlang_1.0.4 pkgconfig_2.0.3 bitops_1.0-7
[76] lattice_0.20-45 purrr_0.3.4 treeio_1.20.1
[79] patchwork_1.1.1 shadowtext_0.1.2 bit_4.0.4
[82] tidyselect_1.1.2 plyr_1.8.7 magrittr_2.0.3
[85] R6_2.5.1 generics_0.1.3 DelayedArray_0.22.0
[88] DBI_1.1.3 pillar_1.8.0 survival_3.3-1
[91] KEGGREST_1.36.3 RCurl_1.98-1.8 tibble_3.1.8
[94] crayon_1.5.1 utf8_1.2.2 viridis_0.6.2
[97] locfit_1.5-9.6 grid_4.2.1 data.table_1.14.2
[100] blob_1.2.3 digest_0.6.29 xtable_1.8-4
[103] tidyr_1.2.0 gridGraphics_0.5-1 munsell_0.5.0
[106] viridisLite_0.4.0 ggplotify_0.1.0```
Thanks Josie
OK thanks James. I'll send off an email to GenBank, see if they feel inclined to correct some errors.
If you go to NCBI and the data appear to be correct, then they have evidently already fixed the errors you have found. The data we download for the annotation packages and the data they present on their website come from the same source. What I was saying is that the data in the
org.Bt.eg.db
package are from April, and you are looking at the data as of August. If there are differences, it is almost surely because NCBI have already corrected the errors.Hello James I talked with NCBI and they say the error is with Bioconductor. Here's their evidence. First, the NCBI history does not bear out your assumption that the annotations were different in April 2022 vs now. For example, cyclin A2, CCNA2, had the correct gene symbol CCNA2 in April 2022, and in fact has had that symbol for many years (since 2006).
https://www.ncbi.nlm.nih.gov/nuccore/NM_001075123.1?report=girevhist
Second, the NCBI specialist noted that all of the errors had been created by the deletion of the two letters "NA" in a gene symbol or "na" in a gene name. eg CCNA2 became CC2. kinase became kise. NAGK became GK. NMNAT2 became NMT2. nanos became nos. NANOS1 became NOS1.
So they asked that Bioconductor see if the error was created inadvertently during the processing of their data to create the org.Bt.eg.db package.
Thanks Josie
Wait a minute. I took you at your word that there was a problem, without checking myself.
Seems OK to me? I don't know anything about the
mygene
package, but so far as I can tell there isn't a problem with theorg.Bt.eg.db
package.OMG. You're absolutely correct. I inadvertently created the error when changing all the genes with no name (NA) to blank spaces in Excel. Sorry for the trouble Josie
Looking closer at
mygene
, it doesn't appear to use any of theOrgDb
packages at all, but instead is hitting the mygene.info REST server. Or at least that's what it says on the tin.And there's obviously nothing wrong with it either. Thanks again for your help.