Question

errors in annotation

0

Entering edit mode

Josie • 0

@a5fedab7

Last seen 2.1 years ago

United States

Hello I found 6 ENTREZ ID that give the wrong gene symbol and/or errors in the gene name from the org.Bt.eg.db database. ENTREZ Gene name Gene symbol

281667 cyclin A2 CC2 (its CCNA2) 505987 glycerol kise GK (it's glycerol kinase) 508132 N-acetylglucosamine kise GK (it's N-acetylglucosamine kinase; NAGK) 511042 nicotimide nucleotide adenylyltransferase 2 NMT2 (its NMNAT2) 786356 nos C2HC-type zinc finger 1 NOS1 (it's nanos C2HC-type zinc finger 1; NANOS1) 515499 proliferating cell nuclear antigen PC (it's PCNA)

code used was:


library(mygene)
genes<-queryMany(row.names(results), scopes="entrezgene", fields=c("symbol","name","entrezgene"), species="9913") 
MapGenes<-data.frame(genes)
MapResults<-merge(results,MapGenes, by.x="row.names", by.y="query"); dim(MapResults)

session info: ```R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.6.8

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] DESeq2_1.36.0 SummarizedExperiment_1.26.1 Biobase_2.56.0
[4] MatrixGenerics_1.8.1 matrixStats_0.62.0 GenomicRanges_1.48.0
[7] GenomeInfoDb_1.32.3 IRanges_2.30.0 S4Vectors_0.34.0
[10] BiocGenerics_0.42.0 edgeR_3.38.4 limma_3.52.2

loaded via a namespace (and not attached): [1] fgsea_1.22.0 colorspace_2.0-3 ggtree_3.4.1
[4] qvalue_2.28.0 XVector_0.36.0 aplot_0.1.6
[7] rstudioapi_0.13 farver_2.1.1 graphlayouts_0.8.1
[10] ggrepel_0.9.1 bit64_4.0.5 AnnotationDbi_1.58.0
[13] fansi_1.0.3 scatterpie_0.1.7 codetools_0.2-18
[16] splines_4.2.1 cachem_1.0.6 GOSemSim_2.22.0
[19] geneplotter_1.74.0 polyclip_1.10-0 jsonlite_1.8.0
[22] annotate_1.74.0 GO.db_3.15.0 png_0.1-7
[25] ggforce_0.3.3 BiocManager_1.30.18 compiler_4.2.1
[28] httr_1.4.3 assertthat_0.2.1 Matrix_1.4-1
[31] fastmap_1.1.0 lazyeval_0.2.2 cli_3.3.0
[34] tweenr_1.0.2 tools_4.2.1 igraph_1.3.4
[37] gtable_0.3.0 glue_1.6.2 GenomeInfoDbData_1.2.8 [40] reshape2_1.4.4 DO.db_2.9 dplyr_1.0.9
[43] fastmatch_1.1-3 Rcpp_1.0.9 enrichplot_1.16.1
[46] vctrs_0.4.1 Biostrings_2.64.0 ape_5.6-2
[49] nlme_3.1-157 ggraph_2.0.6 stringr_1.4.0
[52] lifecycle_1.0.1 clusterProfiler_4.4.4 XML_3.99-0.10
[55] DOSE_3.22.0 zlibbioc_1.42.0 MASS_7.3-57
[58] scales_1.2.0 tidygraph_1.2.1 parallel_4.2.1
[61] RColorBrewer_1.1-3 memoise_2.0.1 gridExtra_2.3
[64] ggplot2_3.3.6 downloader_0.4 ggfun_0.0.6
[67] yulab.utils_0.0.5 stringi_1.7.8 RSQLite_2.2.15
[70] genefilter_1.78.0 tidytree_0.3.9 BiocParallel_1.30.3
[73] rlang_1.0.4 pkgconfig_2.0.3 bitops_1.0-7
[76] lattice_0.20-45 purrr_0.3.4 treeio_1.20.1
[79] patchwork_1.1.1 shadowtext_0.1.2 bit_4.0.4
[82] tidyselect_1.1.2 plyr_1.8.7 magrittr_2.0.3
[85] R6_2.5.1 generics_0.1.3 DelayedArray_0.22.0
[88] DBI_1.1.3 pillar_1.8.0 survival_3.3-1
[91] KEGGREST_1.36.3 RCurl_1.98-1.8 tibble_3.1.8
[94] crayon_1.5.1 utf8_1.2.2 viridis_0.6.2
[97] locfit_1.5-9.6 grid_4.2.1 data.table_1.14.2
[100] blob_1.2.3 digest_0.6.29 xtable_1.8-4
[103] tidyr_1.2.0 gridGraphics_0.5-1 munsell_0.5.0
[106] viridisLite_0.4.0 ggplotify_0.1.0```

Thanks Josie

org.Bt.eg.db • 1.8k views

ADD COMMENT • link 3.4 years ago Josie • 0

score 0 · Answer 1 · 2022-08-12

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

Thanks for pointing that out. However, it is unlikely that this is an error on our part, as we simply process the data we can get from public sources into a format that we hope is useful for our users, without any changes to the underlying data. So any errors or typographic errors almost surely existed in the data we got from NCBI.

In addition, all of these data are pretty fluid. Your first gene (CCNA2), was updated last on 5 Aug, whereas the annotation package you are using is from the April release. There may have been updates to what you can get at NCBI (those data are updated weekly) in the intervening period, which would also make the org.Bt.eg.db package appear incorrect.

ADD COMMENT • link 3.4 years ago James W. MacDonald 68k

0

Entering edit mode

OK thanks James. I'll send off an email to GenBank, see if they feel inclined to correct some errors.

ADD REPLY • link 3.4 years ago Josie • 0

0

Entering edit mode

If you go to NCBI and the data appear to be correct, then they have evidently already fixed the errors you have found. The data we download for the annotation packages and the data they present on their website come from the same source. What I was saying is that the data in the org.Bt.eg.db package are from April, and you are looking at the data as of August. If there are differences, it is almost surely because NCBI have already corrected the errors.

ADD REPLY • link 3.4 years ago James W. MacDonald 68k

0

Entering edit mode

Hello James I talked with NCBI and they say the error is with Bioconductor. Here's their evidence. First, the NCBI history does not bear out your assumption that the annotations were different in April 2022 vs now. For example, cyclin A2, CCNA2, had the correct gene symbol CCNA2 in April 2022, and in fact has had that symbol for many years (since 2006).

https://www.ncbi.nlm.nih.gov/nuccore/NM_001075123.1?report=girevhist

Second, the NCBI specialist noted that all of the errors had been created by the deletion of the two letters "NA" in a gene symbol or "na" in a gene name. eg CCNA2 became CC2. kinase became kise. NAGK became GK. NMNAT2 became NMT2. nanos became nos. NANOS1 became NOS1.

So they asked that Bioconductor see if the error was created inadvertently during the processing of their data to create the org.Bt.eg.db package.

Thanks Josie

ADD REPLY • link 3.4 years ago Josie • 0

0

Entering edit mode

Wait a minute. I took you at your word that there was a problem, without checking myself.

> select(org.Bt.eg.db, c("281667","505987","508132","511042","786356","515499"), c("SYMBOL", "GENENAME"))
'select()' returned 1:1 mapping between keys and columns
  ENTREZID SYMBOL                                      GENENAME
1   281667  CCNA2                                     cyclin A2
2   505987     GK                               glycerol kinase
3   508132   NAGK                    N-acetylglucosamine kinase
4   511042 NMNAT2 nicotinamide nucleotide adenylyltransferase 2
5   786356 NANOS1                 nanos C2HC-type zinc finger 1
6   515499   PCNA            proliferating cell nuclear antigen

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Bt.eg.db_3.15.0  AnnotationDbi_1.58.0 IRanges_2.30.0      
[4] S4Vectors_0.34.0     Biobase_2.56.0       BiocGenerics_0.42.0 
[7] BiocManager_1.30.18 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9             XVector_0.36.0         zlibbioc_1.42.0       
 [4] bit_4.0.4              R6_2.5.1               rlang_1.0.4           
 [7] fastmap_1.1.0          GenomeInfoDb_1.32.3    blob_1.2.3            
[10] httr_1.4.4             tools_4.2.0            png_0.1-7             
[13] cli_3.3.0              DBI_1.1.3              remotes_2.4.2         
[16] bit64_4.0.5            crayon_1.5.1           GenomeInfoDbData_1.2.8
[19] bitops_1.0-7           vctrs_0.4.1            RCurl_1.98-1.8        
[22] KEGGREST_1.36.3        curl_4.3.2             memoise_2.0.1         
[25] cachem_1.0.6           RSQLite_2.2.16         compiler_4.2.0        
[28] Biostrings_2.64.0      pkgconfig_2.0.3

Seems OK to me? I don't know anything about the mygene package, but so far as I can tell there isn't a problem with the org.Bt.eg.db package.

ADD REPLY • link 3.4 years ago James W. MacDonald 68k

0

Entering edit mode

OMG. You're absolutely correct. I inadvertently created the error when changing all the genes with no name (NA) to blank spaces in Excel. Sorry for the trouble Josie

ADD REPLY • link 3.4 years ago Josie • 0

0

Entering edit mode

Looking closer at mygene, it doesn't appear to use any of the OrgDb packages at all, but instead is hitting the mygene.info REST server. Or at least that's what it says on the tin.