Gene conversion from SYMBOL to ENTREZ using AnnotationDbi doesn't convert some of the genes
Entering edit mode
Emilia ▴ 30
Last seen 2.3 years ago
Argentina, Rosario (Universidad Naciona…

Hello! I'm trying to convert a set of 6422 gene symbols into EntrezID's (I got this list of genes from doing a differential expression analysis with TCGAbiolinks on a TCGA dataset of hepatocellular carcinoma) and I'm trying to use AnnotationDbi for that. When I do this, 838 genes return an NA for ENTREZID. However, some of them do have an Entrez ID associated with the name that was provided in the original dataset (e.g: one of my genes of interest is SNAI2 and it does have an associated Entrez ID which is 6591 but it was still among the NA's)

Any ideas on why this could be happening? I'm new to bioconductor packages so I'm sorry if this is too dumb!

#the data frame with my differentially expressed genes is called DEedgeR
Gene_names <- DEedgeR$Genes
Genes2 <- select(, Genes_names, 'ENTREZID', 'SYMBOL')

sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

[1] LC_COLLATE=Spanish_Argentina.1252  LC_CTYPE=Spanish_Argentina.1252   
[3] LC_MONETARY=Spanish_Argentina.1252 LC_NUMERIC=C                      
[5] LC_TIME=Spanish_Argentina.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pathview_1.30.0        DO.db_2.9              KEGG.db_3.2.4          KEGGprofile_1.32.0    
 [5]    AnnotationDbi_1.52.0   IRanges_2.24.0         S4Vectors_0.28.0      
 [9] Biobase_2.50.0         BiocGenerics_0.36.0    clusterProfiler_3.18.0 TCGAbiolinks_2.18.0   
[13] BiocManager_1.30.10    ggthemes_4.2.0         survival_3.2-7         survminer_0.4.8       
[17] ggpubr_0.4.0           ggplot2_3.3.2          xlsx_0.6.5             tidyr_1.1.2           
[21] readxl_1.3.1           dplyr_1.0.2           

loaded via a namespace (and not attached):
  [1] shadowtext_0.0.7            backports_1.2.0             fastmatch_1.1-0            
  [4] BiocFileCache_1.14.0        plyr_1.8.6                  igraph_1.2.6               
  [7] splines_4.0.3               BiocParallel_1.24.1         GenomeInfoDb_1.26.1        
 [10] digest_0.6.27               GOSemSim_2.16.1             viridis_0.5.1              
 [13] GO.db_3.12.1                fansi_0.4.1                 magrittr_2.0.1             
 [16] memoise_1.1.0               openxlsx_4.2.3              Biostrings_2.58.0          
 [19] readr_1.4.0                 graphlayouts_0.7.1          matrixStats_0.57.0         
 [22] R.utils_2.10.1              askpass_1.1                 enrichplot_1.10.1          
 [25] prettyunits_1.1.1           colorspace_2.0-0            blob_1.2.1                 
 [28] rvest_0.3.6                 rappdirs_0.3.1              ggrepel_0.8.2              
 [31] haven_2.3.1                 xfun_0.19                   crayon_1.3.4               
 [34] RCurl_1.98-1.2              jsonlite_1.7.1              graph_1.68.0               
 [37] scatterpie_0.1.5            zoo_1.8-8                   glue_1.4.2                 
 [40] polyclip_1.10-0             gtable_0.3.0                zlibbioc_1.36.0            
 [43] XVector_0.30.0              DelayedArray_0.16.0         car_3.0-10                 
 [46] Rgraphviz_2.34.0            abind_1.4-5                 scales_1.1.1               
 [49] DOSE_3.16.0                 DBI_1.1.0                   rstatix_0.6.0              
 [52] Rcpp_1.0.5                  viridisLite_0.3.0           xtable_1.8-4               
 [55] progress_1.2.2              foreign_0.8-80              bit_4.0.4                  
 [58] km.ci_0.5-2                 httr_1.4.2                  fgsea_1.16.0               
 [61] RColorBrewer_1.1-2          ellipsis_0.3.1              pkgconfig_2.0.3            
 [64] XML_3.99-0.5                rJava_0.9-13                R.methodsS3_1.8.1          
 [67] farver_2.0.3                dbplyr_2.0.0                tidyselect_1.1.0           
 [70] rlang_0.4.8                 reshape2_1.4.4              TeachingDemos_2.12         
 [73] munsell_0.5.0               cellranger_1.1.0            tools_4.0.3                
 [76] cli_2.2.0                   downloader_0.4              generics_0.1.0             
 [79] RSQLite_2.2.1               broom_0.7.2                 stringr_1.4.0              
 [82] knitr_1.30                  bit64_4.0.5                 tidygraph_1.2.0            
 [85] zip_2.1.1                   survMisc_0.5.5              purrr_0.3.4                
 [88] KEGGREST_1.30.1             ggraph_2.0.4                R.oo_1.24.0                
 [91] KEGGgraph_1.50.0            xml2_1.3.2                  biomaRt_2.46.0             
 [94] compiler_4.0.3              rstudioapi_0.13             png_0.1-7                  
 [97] curl_4.3                    ggsignif_0.6.0              tibble_3.0.4               
[100] tweenr_1.0.1                stringi_1.5.3               TCGAbiolinksGUI.data_1.10.0
[103] forcats_0.5.0               lattice_0.20-41             Matrix_1.2-18              
[106] KMsurv_0.1-5                vctrs_0.3.5                 pillar_1.4.7               
[109] lifecycle_0.2.0             data.table_1.13.2           cowplot_1.1.0              
[112] bitops_1.0-6                GenomicRanges_1.42.0        qvalue_2.22.0              
[115] R6_2.5.0                    gridExtra_2.3               rio_0.5.16                 
[118] MASS_7.3-53                 assertthat_0.2.1            SummarizedExperiment_1.20.0
[121] xlsxjars_0.6.1              openssl_1.4.3               withr_2.3.0                
[124] GenomeInfoDbData_1.2.4      hms_0.5.3                   grid_4.0.3                 
[127] rvcheck_0.1.8               MatrixGenerics_1.2.0        carData_3.0-4              
[130] ggforce_0.3.2               tinytex_0.27               
AnnotationDbi • 1.6k views
Entering edit mode
Last seen 9 hours ago
Republic of Ireland

It is expected that some gene symbols will not have a corresponding Entrez, Ensembl, or other gene ID.

Just to be sure, your code is incorrect and should be:

Gene_names <- DEedgeR$Genes
Genes2 <- select(, keys = Gene_names,
  columns = c('ENTREZID'), keytype = 'SYMBOL')

Error on 'Gene_names' (first line) and 'Genes_names' (second line). Also, I would get into the habit of writing the parameter names so that you can have 100% confidence that you are using the function correctly.

If these discrepancies that I mention above are not the actual problem that you are facing, then please provide some example gene symbols that are not matching.


Entering edit mode

I imagined that some Gene IDs didn't have a corresponding EntrezId but I knew this one in particular had because it's the one I work with. I tried your code and it worked so thank you very much! I've only been working with bioconductor packages for a few days so this is really helpful for me

Entering edit mode
Last seen 13 hours ago
United States
> args(select)
function (x, keys, columns, keytype, ...) 

## you did, in essence
> select(, "6591", "ENTREZID","SYMBOL")
Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.

## which makes no sense, given the argument order
> select(, "6591", "SYMBOL", "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
1     6591  SNAI2

The help pages are your friend see ?select


Login before adding your answer.

Traffic: 555 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6