Gene conversion from SYMBOL to ENTREZ using AnnotationDbi doesn't convert some of the genes
2
0
Entering edit mode
Emilia ▴ 30
@emiliabaffo
Last seen 4 months ago

Hello! I'm trying to convert a set of 6422 gene symbols into EntrezID's (I got this list of genes from doing a differential expression analysis with TCGAbiolinks on a TCGA dataset of hepatocellular carcinoma) and I'm trying to use AnnotationDbi for that. When I do this, 838 genes return an NA for ENTREZID. However, some of them do have an Entrez ID associated with the name that was provided in the original dataset (e.g: one of my genes of interest is SNAI2 and it does have an associated Entrez ID which is 6591 but it was still among the NA's)

Any ideas on why this could be happening? I'm new to bioconductor packages so I'm sorry if this is too dumb!

#the data frame with my differentially expressed genes is called DEedgeR
Gene_names <- DEedgeR$Genes Genes2 <- select(org.Hs.eg.db, Genes_names, 'ENTREZID', 'SYMBOL') sessionInfo( ) R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363) Matrix products: default locale: [1] LC_COLLATE=Spanish_Argentina.1252 LC_CTYPE=Spanish_Argentina.1252 [3] LC_MONETARY=Spanish_Argentina.1252 LC_NUMERIC=C [5] LC_TIME=Spanish_Argentina.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] pathview_1.30.0 DO.db_2.9 KEGG.db_3.2.4 KEGGprofile_1.32.0 [5] org.Hs.eg.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.0 S4Vectors_0.28.0 [9] Biobase_2.50.0 BiocGenerics_0.36.0 clusterProfiler_3.18.0 TCGAbiolinks_2.18.0 [13] BiocManager_1.30.10 ggthemes_4.2.0 survival_3.2-7 survminer_0.4.8 [17] ggpubr_0.4.0 ggplot2_3.3.2 xlsx_0.6.5 tidyr_1.1.2 [21] readxl_1.3.1 dplyr_1.0.2 loaded via a namespace (and not attached): [1] shadowtext_0.0.7 backports_1.2.0 fastmatch_1.1-0 [4] BiocFileCache_1.14.0 plyr_1.8.6 igraph_1.2.6 [7] splines_4.0.3 BiocParallel_1.24.1 GenomeInfoDb_1.26.1 [10] digest_0.6.27 GOSemSim_2.16.1 viridis_0.5.1 [13] GO.db_3.12.1 fansi_0.4.1 magrittr_2.0.1 [16] memoise_1.1.0 openxlsx_4.2.3 Biostrings_2.58.0 [19] readr_1.4.0 graphlayouts_0.7.1 matrixStats_0.57.0 [22] R.utils_2.10.1 askpass_1.1 enrichplot_1.10.1 [25] prettyunits_1.1.1 colorspace_2.0-0 blob_1.2.1 [28] rvest_0.3.6 rappdirs_0.3.1 ggrepel_0.8.2 [31] haven_2.3.1 xfun_0.19 crayon_1.3.4 [34] RCurl_1.98-1.2 jsonlite_1.7.1 graph_1.68.0 [37] scatterpie_0.1.5 zoo_1.8-8 glue_1.4.2 [40] polyclip_1.10-0 gtable_0.3.0 zlibbioc_1.36.0 [43] XVector_0.30.0 DelayedArray_0.16.0 car_3.0-10 [46] Rgraphviz_2.34.0 abind_1.4-5 scales_1.1.1 [49] DOSE_3.16.0 DBI_1.1.0 rstatix_0.6.0 [52] Rcpp_1.0.5 viridisLite_0.3.0 xtable_1.8-4 [55] progress_1.2.2 foreign_0.8-80 bit_4.0.4 [58] km.ci_0.5-2 httr_1.4.2 fgsea_1.16.0 [61] RColorBrewer_1.1-2 ellipsis_0.3.1 pkgconfig_2.0.3 [64] XML_3.99-0.5 rJava_0.9-13 R.methodsS3_1.8.1 [67] farver_2.0.3 dbplyr_2.0.0 tidyselect_1.1.0 [70] rlang_0.4.8 reshape2_1.4.4 TeachingDemos_2.12 [73] munsell_0.5.0 cellranger_1.1.0 tools_4.0.3 [76] cli_2.2.0 downloader_0.4 generics_0.1.0 [79] RSQLite_2.2.1 broom_0.7.2 stringr_1.4.0 [82] knitr_1.30 bit64_4.0.5 tidygraph_1.2.0 [85] zip_2.1.1 survMisc_0.5.5 purrr_0.3.4 [88] KEGGREST_1.30.1 ggraph_2.0.4 R.oo_1.24.0 [91] KEGGgraph_1.50.0 xml2_1.3.2 biomaRt_2.46.0 [94] compiler_4.0.3 rstudioapi_0.13 png_0.1-7 [97] curl_4.3 ggsignif_0.6.0 tibble_3.0.4 [100] tweenr_1.0.1 stringi_1.5.3 TCGAbiolinksGUI.data_1.10.0 [103] forcats_0.5.0 lattice_0.20-41 Matrix_1.2-18 [106] KMsurv_0.1-5 vctrs_0.3.5 pillar_1.4.7 [109] lifecycle_0.2.0 data.table_1.13.2 cowplot_1.1.0 [112] bitops_1.0-6 GenomicRanges_1.42.0 qvalue_2.22.0 [115] R6_2.5.0 gridExtra_2.3 rio_0.5.16 [118] MASS_7.3-53 assertthat_0.2.1 SummarizedExperiment_1.20.0 [121] xlsxjars_0.6.1 openssl_1.4.3 withr_2.3.0 [124] GenomeInfoDbData_1.2.4 hms_0.5.3 grid_4.0.3 [127] rvcheck_0.1.8 MatrixGenerics_1.2.0 carData_3.0-4 [130] ggforce_0.3.2 tinytex_0.27 >  AnnotationDbi org.Hs.eg.db • 364 views ADD COMMENT 2 Entering edit mode @kevin Last seen 6 hours ago Republic of Ireland It is expected that some gene symbols will not have a corresponding Entrez, Ensembl, or other gene ID. Just to be sure, your code is incorrect and should be: Gene_names <- DEedgeR$Genes
Genes2 <- select(org.Hs.eg.db, keys = Gene_names,
columns = c('ENTREZID'), keytype = 'SYMBOL')


Error on 'Gene_names' (first line) and 'Genes_names' (second line). Also, I would get into the habit of writing the parameter names so that you can have 100% confidence that you are using the function correctly.

If these discrepancies that I mention above are not the actual problem that you are facing, then please provide some example gene symbols that are not matching.

Kevin

1
Entering edit mode

I imagined that some Gene IDs didn't have a corresponding EntrezId but I knew this one in particular had because it's the one I work with. I tried your code and it worked so thank you very much! I've only been working with bioconductor packages for a few days so this is really helpful for me

1
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States
> args(select)
function (x, keys, columns, keytype, ...)

## you did, in essence
> select(org.Hs.eg.db, "6591", "ENTREZID","SYMBOL")
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.

## which makes no sense, given the argument order
> select(org.Hs.eg.db, "6591", "SYMBOL", "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
ENTREZID SYMBOL
1     6591  SNAI2
>


The help pages are your friend see ?select