As I wanted to retrieve all relevant genetic annotation for human genome using AnnotationDbi package as well as reference database org.Hs.eg.db, I have noticed following seemingly odd observation.
NCBI_gene_annotationDbi <- AnnotationDbi::select(org.Hs.eg.db, keys=AnnotationDbi::keys(org.Hs.eg.db, keytype="SYMBOL"), columns=c("SYMBOL", "GENENAME","ENTREZID"), keytype="SYMBOL") duplicated_genes <- NCBI_gene_annotationDbi$SYMBOL[duplicated(NCBI_gene_annotationDbi$SYMBOL)] NCBI_gene_annotationDbi_duplicated <- NCBI_gene_annotationDbi %>% dplyr::filter(SYMBOL %in% duplicated_genes)
As you can notice, only a small fraction (8 symbol over a total 61538) of SYMBOL have duplicated inputs in ENTREZID field. Besides, most of these duplicated values often integrate a general description and annotation of the gene, while the other ones seem superfluous (for instance, HBD has two ENTREZID matches, one general about hemoglobin subunit deltat, and another one really specific, and not describing its function: bone disease).
For such a small number of duplicated inputs, can't it be possible to clear all these seemingly superfluous one-to-many matches, only keeping for each the most relevant one. Or maybe merge both descriptions?
sessionInfo( ) R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS: /softhpc/R/4.0.2/lib64/R/lib/libRblas.so LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats4 parallel stats graphics grDevices utils datasets methods base other attached packages:  GO.db_3.12.1 hgu133plus2probe_2.18.0 hgu133plus2.db_3.2.3 org.Hs.eg.db_3.12.0  hgu133plus2cdf_2.18.0 bmkanalysis_1.0.0 testthat_3.0.1 affy_1.68.0  EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.14.0 AnnotationFilter_1.14.0 GenomicFeatures_1.42.1  AnnotationDbi_1.52.0 Biobase_2.50.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.2  IRanges_2.24.1 S4Vectors_0.28.1 BiocGenerics_0.36.0