Question

Gene conversion from SYMBOL to ENTREZ using AnnotationDbi doesn't convert some of the genes

0

Entering edit mode

Emilia ▴ 30

@emiliabaffo

Last seen 2.9 years ago

Argentina, Rosario (Universidad Naciona…

Hello! I'm trying to convert a set of 6422 gene symbols into EntrezID's (I got this list of genes from doing a differential expression analysis with TCGAbiolinks on a TCGA dataset of hepatocellular carcinoma) and I'm trying to use AnnotationDbi for that. When I do this, 838 genes return an NA for ENTREZID. However, some of them do have an Entrez ID associated with the name that was provided in the original dataset (e.g: one of my genes of interest is SNAI2 and it does have an associated Entrez ID which is 6591 but it was still among the NA's)

Any ideas on why this could be happening? I'm new to bioconductor packages so I'm sorry if this is too dumb!

#the data frame with my differentially expressed genes is called DEedgeR
Gene_names <- DEedgeR$Genes
Genes2 <- select(org.Hs.eg.db, Genes_names, 'ENTREZID', 'SYMBOL')

sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Argentina.1252  LC_CTYPE=Spanish_Argentina.1252   
[3] LC_MONETARY=Spanish_Argentina.1252 LC_NUMERIC=C                      
[5] LC_TIME=Spanish_Argentina.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pathview_1.30.0        DO.db_2.9              KEGG.db_3.2.4          KEGGprofile_1.32.0    
 [5] org.Hs.eg.db_3.12.0    AnnotationDbi_1.52.0   IRanges_2.24.0         S4Vectors_0.28.0      
 [9] Biobase_2.50.0         BiocGenerics_0.36.0    clusterProfiler_3.18.0 TCGAbiolinks_2.18.0   
[13] BiocManager_1.30.10    ggthemes_4.2.0         survival_3.2-7         survminer_0.4.8       
[17] ggpubr_0.4.0           ggplot2_3.3.2          xlsx_0.6.5             tidyr_1.1.2           
[21] readxl_1.3.1           dplyr_1.0.2           

loaded via a namespace (and not attached):
  [1] shadowtext_0.0.7            backports_1.2.0             fastmatch_1.1-0            
  [4] BiocFileCache_1.14.0        plyr_1.8.6                  igraph_1.2.6               
  [7] splines_4.0.3               BiocParallel_1.24.1         GenomeInfoDb_1.26.1        
 [10] digest_0.6.27               GOSemSim_2.16.1             viridis_0.5.1              
 [13] GO.db_3.12.1                fansi_0.4.1                 magrittr_2.0.1             
 [16] memoise_1.1.0               openxlsx_4.2.3              Biostrings_2.58.0          
 [19] readr_1.4.0                 graphlayouts_0.7.1          matrixStats_0.57.0         
 [22] R.utils_2.10.1              askpass_1.1                 enrichplot_1.10.1          
 [25] prettyunits_1.1.1           colorspace_2.0-0            blob_1.2.1                 
 [28] rvest_0.3.6                 rappdirs_0.3.1              ggrepel_0.8.2              
 [31] haven_2.3.1                 xfun_0.19                   crayon_1.3.4               
 [34] RCurl_1.98-1.2              jsonlite_1.7.1              graph_1.68.0               
 [37] scatterpie_0.1.5            zoo_1.8-8                   glue_1.4.2                 
 [40] polyclip_1.10-0             gtable_0.3.0                zlibbioc_1.36.0            
 [43] XVector_0.30.0              DelayedArray_0.16.0         car_3.0-10                 
 [46] Rgraphviz_2.34.0            abind_1.4-5                 scales_1.1.1               
 [49] DOSE_3.16.0                 DBI_1.1.0                   rstatix_0.6.0              
 [52] Rcpp_1.0.5                  viridisLite_0.3.0           xtable_1.8-4               
 [55] progress_1.2.2              foreign_0.8-80              bit_4.0.4                  
 [58] km.ci_0.5-2                 httr_1.4.2                  fgsea_1.16.0               
 [61] RColorBrewer_1.1-2          ellipsis_0.3.1              pkgconfig_2.0.3            
 [64] XML_3.99-0.5                rJava_0.9-13                R.methodsS3_1.8.1          
 [67] farver_2.0.3                dbplyr_2.0.0                tidyselect_1.1.0           
 [70] rlang_0.4.8                 reshape2_1.4.4              TeachingDemos_2.12         
 [73] munsell_0.5.0               cellranger_1.1.0            tools_4.0.3                
 [76] cli_2.2.0                   downloader_0.4              generics_0.1.0             
 [79] RSQLite_2.2.1               broom_0.7.2                 stringr_1.4.0              
 [82] knitr_1.30                  bit64_4.0.5                 tidygraph_1.2.0            
 [85] zip_2.1.1                   survMisc_0.5.5              purrr_0.3.4                
 [88] KEGGREST_1.30.1             ggraph_2.0.4                R.oo_1.24.0                
 [91] KEGGgraph_1.50.0            xml2_1.3.2                  biomaRt_2.46.0             
 [94] compiler_4.0.3              rstudioapi_0.13             png_0.1-7                  
 [97] curl_4.3                    ggsignif_0.6.0              tibble_3.0.4               
[100] tweenr_1.0.1                stringi_1.5.3               TCGAbiolinksGUI.data_1.10.0
[103] forcats_0.5.0               lattice_0.20-41             Matrix_1.2-18              
[106] KMsurv_0.1-5                vctrs_0.3.5                 pillar_1.4.7               
[109] lifecycle_0.2.0             data.table_1.13.2           cowplot_1.1.0              
[112] bitops_1.0-6                GenomicRanges_1.42.0        qvalue_2.22.0              
[115] R6_2.5.0                    gridExtra_2.3               rio_0.5.16                 
[118] MASS_7.3-53                 assertthat_0.2.1            SummarizedExperiment_1.20.0
[121] xlsxjars_0.6.1              openssl_1.4.3               withr_2.3.0                
[124] GenomeInfoDbData_1.2.4      hms_0.5.3                   grid_4.0.3                 
[127] rvcheck_0.1.8               MatrixGenerics_1.2.0        carData_3.0-4              
[130] ggforce_0.3.2               tinytex_0.27               
>

AnnotationDbi org.Hs.eg.db • 2.3k views

ADD COMMENT • link 3.4 years ago Emilia ▴ 30

score 2 · Answer 1 · 2020-11-30

2

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 9 hours ago

Republic of Ireland

It is expected that some gene symbols will not have a corresponding Entrez, Ensembl, or other gene ID.

Just to be sure, your code is incorrect and should be:

Gene_names <- DEedgeR$Genes
Genes2 <- select(org.Hs.eg.db, keys = Gene_names,
  columns = c('ENTREZID'), keytype = 'SYMBOL')

Error on 'Gene_names' (first line) and 'Genes_names' (second line). Also, I would get into the habit of writing the parameter names so that you can have 100% confidence that you are using the function correctly.

If these discrepancies that I mention above are not the actual problem that you are facing, then please provide some example gene symbols that are not matching.

Kevin

ADD COMMENT • link 3.4 years ago Kevin Blighe ★ 3.9k

1

Entering edit mode

I imagined that some Gene IDs didn't have a corresponding EntrezId but I knew this one in particular had because it's the one I work with. I tried your code and it worked so thank you very much! I've only been working with bioconductor packages for a few days so this is really helpful for me

ADD REPLY • link 3.4 years ago Emilia ▴ 30

score 1 · Answer 2 · 2020-11-30

> args(select)
function (x, keys, columns, keytype, ...) 

## you did, in essence
> select(org.Hs.eg.db, "6591", "ENTREZID","SYMBOL")
Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.

## which makes no sense, given the argument order
> select(org.Hs.eg.db, "6591", "SYMBOL", "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
  ENTREZID SYMBOL
1     6591  SNAI2
>

The help pages are your friend see ?select