KEGG enrichment in R and gene IDs
2
0
Entering edit mode
Laia ▴ 10
@239caad3
Last seen 8 months ago
Belgium

Hi,

I am trying to run a KEGG enrichment analysis on my data. My genes are in SYMBOL, which I converted to ENTREZID, but I need them in "kegg" or "ncbi-geneID" to run enrichKEGG. I looked for packages that would convert these, but they are not updated: MS_convertGene, keggConv, KEGGREST, topGO.

What worked was using the website: https://www.genome.jp/kegg/tool/conv_id.html But I have many gene lists to be converted.....

Is there any tool that I could use in R to convert these IDs in a more handy way?

Thank you. Laia

KEGG enrichKEGG R entrezid • 5.5k views
ADD COMMENT
0
Entering edit mode
Guido Hooiveld ★ 4.0k
@guido-hooiveld-2020
Last seen 4 days ago
Wageningen University, Wageningen, the …

Please note that ENTREZID == ncbi-geneID, so it seems to me you have KEGG-compatible IDs.

Reading between the lines it seems that your actual question is on how to convert gene symbols to entrezids. If so, please post the code/way you did this now, together with the species you are working with and some example symbols.

ADD COMMENT
0
Entering edit mode

Hi Guido,

Thank you for your quick reply. Then if they match, I don't get any mapping. This is my code:

symbol_genes <- names(deLogFC_up)
entrez_genes <- mapIds(org.Hs.eg.db, symbol_genes, 'ENTREZID', 'SYMBOL')
kegg <- enrichKEGG(gene = entrez_genes, organism = 'hsa', keyType="kegg", pvalueCutoff = 0.05)

And this is the error message:

Reading KEGG annotation online:
No gene can be mapped... Expected input gene ID:
return NULL...
Warning messages: 1: In utils::download.file(url, quiet = TRUE, method = method, ...) : the 'wininet' method is deprecated for http:// and https:// URLs 2: In utils::download.file(url, quiet = TRUE, method = method, ...) : the 'wininet' method is deprecated for http:// and https:// URLs

Maybe it is important to mention that my gene list is 424 genes long? Could that be an issue?

Thank you. Laia

ADD REPLY
0
Entering edit mode

Please show the output from symbol_genes[1:10] and entrez_genes[1:10]; thus (for example) the first 10 symbols resp. entrezids.

ADD REPLY
0
Entering edit mode
symbol_genes[1:10]

 "TREML3P"        "IL13RA2"        "MMP10"          "LOC124907722"   "AKR1C7P"        "LOC124903970"   "IL11"          
 "LINC02392"      "LOC124903636_1" "LOC124905027"  

entrez_genes[1:10]


    TREML3P        IL13RA2          MMP10   LOC124907722        AKR1C7P   LOC124903970           IL11      LINC02392 
      "340206"         "3598"         "4319"             NA       "648947"             NA         "3589"    "105369893" 
LOC124903636_1   LOC124905027 
            NA             NA

Oops. Could it be that, because the "entrez_genes" vector is named with the symbol names, that then is not working?

ADD REPLY
0
Entering edit mode

Yep, that is the reason!

This is working (note the use of as.character(), and that in the code below no significance cutoff is applied):

> library(clusterProfiler)
> library(org.Hs.eg.db)
> 
> symbol_genes <- c("TREML3P","IL13RA2","MMP10","LOC124907722","AKR1C7P","LOC124903970","IL11","LINC02392","LOC124903636_1","LOC124905027")  
> 
> entrez_genes <- as.character( mapIds(org.Hs.eg.db, symbol_genes, 'ENTREZID', 'SYMBOL') )
'select()' returned 1:1 mapping between keys and columns
> 
> entrez_genes
 [1] "340206"    "3598"      "4319"      "124907722" "648947"    "124903970"
 [7] "3589"      "105369893" NA          "124905027"
> 
> kegg <- enrichKEGG(gene = entrez_genes, organism = 'hsa', keyType="kegg", pvalueCutoff = 1)
> 
> as.data.frame(kegg)
               ID                            Description GeneRatio  BgRatio
hsa04630 hsa04630             JAK-STAT signaling pathway       2/2 166/8292
hsa04060 hsa04060 Cytokine-cytokine receptor interaction       2/2 295/8292
hsa05323 hsa05323                   Rheumatoid arthritis       1/2  93/8292
hsa04640 hsa04640             Hematopoietic cell lineage       1/2  99/8292
              pvalue    p.adjust qvalue    geneID Count
hsa04630 0.000398406 0.001593624     NA 3598/3589     2
hsa04060 0.001261546 0.002523092     NA 3598/3589     2
hsa05323 0.022306806 0.023737315     NA      3589     1
hsa04640 0.023737315 0.023737315     NA      3589     1
> 
> packageVersion("clusterProfiler")
[1] ‘4.6.2’
> 
> 
> 
ADD REPLY
0
Entering edit mode

I keep getting the same NULL error..

entrez_genes <- as.character(mapIds(org.Hs.eg.db, symbol_genes, 'ENTREZID', 'SYMBOL'))

'select()' returned 1:1 mapping between keys and columns

entrez_genes[1:10]

[1] "340206" "3598" "4319" NA "648947" NA "3589" "105369893" [9] NA NA

kegg <- enrichKEGG(gene = entrez_genes, organism = 'hsa', keyType="kegg", pvalueCutoff = 1)

--> No gene can be mapped.... --> Expected input gene ID: --> return NULL...

Should I remove all NA? -- Nope, this did not work neither.

Also, how come you get less NA than I do?

Yours: [1] "340206" "3598" "4319" "124907722" "648947" "124903970" [7] "3589" "105369893" NA "124905027"

Mine: [1] "340206" "3598" "4319" NA "648947" NA
[7] "3589" "105369893" NA NA

ADD REPLY
0
Entering edit mode

No, having NA in your vector will work.

I did notice that I retrieved only a single NA (8th position), but you got 4. My Bioconductor installation is up-to-date (v3.16, note this version id in that of org.Hs.eg.db), so this suggests you Bioconductor/package installation is not. Please check, and update if needed.

> ## using your vector of entrez_genes
> entrez_genes <- c("340206", "3598" ,"4319", NA ,"648947", NA ,"3589" ,"105369893", NA, NA)
> 
> entrez_genes
 [1] "340206"    "3598"      "4319"      NA          "648947"    NA         
 [7] "3589"      "105369893" NA          NA         
> 
> ## note that @gene is 7 in length, which shows that the NA are automagically ignored (6 genes + 1 not-duplicated NA = 7).
> kegg <- enrichKEGG(gene = entrez_genes, organism = 'hsa', keyType="kegg", pvalueCutoff = 1)
> kegg
#
# over-representation test
#
#...@organism    hsa 
#...@ontology    KEGG 
#...@keytype     kegg 
#...@gene        chr [1:7] "340206" "3598" "4319" NA "648947" "3589" "105369893"
#...pvalues adjusted by 'BH' with cutoff <1 
#...4 enriched terms found
'data.frame':   4 obs. of  9 variables:
 $ ID         : chr  "hsa04630" "hsa04060" "hsa05323" "hsa04640"
 $ Description: chr  "JAK-STAT signaling pathway" "Cytokine-cytokine receptor interaction" "Rheumatoid arthritis" "Hematopoietic cell lineage"
 $ GeneRatio  : chr  "2/2" "2/2" "1/2" "1/2"
 $ BgRatio    : chr  "166/8292" "295/8292" "93/8292" "99/8292"
 $ pvalue     : num  0.000398 0.001262 0.022307 0.023737
 $ p.adjust   : num  0.00159 0.00252 0.02374 0.02374
 $ qvalue     : logi  NA NA NA NA
 $ geneID     : chr  "3598/3589" "3598/3589" "3589" "3589"
 $ Count      : int  2 2 1 1
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> packageVersion("clusterProfiler")
[1] ‘4.6.2’
> packageVersion("org.Hs.eg.db")
[1] ‘3.16.0’
> 
> ## for completeness
> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.16.0   AnnotationDbi_1.60.2  IRanges_2.32.0       
[4] S4Vectors_0.36.2      Biobase_2.58.0        BiocGenerics_0.44.0  
[7] clusterProfiler_4.6.2

loaded via a namespace (and not attached):
 [1] nlme_3.1-162           bitops_1.0-7           ggtree_3.6.2          
 [4] enrichplot_1.18.3      bit64_4.0.5            HDO.db_0.99.1         
 [7] RColorBrewer_1.1-3     httr_1.4.5             GenomeInfoDb_1.34.9   
[10] tools_4.2.2            utf8_1.2.3             R6_2.5.1              
[13] lazyeval_0.2.2         DBI_1.1.3              colorspace_2.1-0      
[16] withr_2.5.0            tidyselect_1.2.0       gridExtra_2.3         
[19] bit_4.0.5              compiler_4.2.2         cli_3.6.0             
[22] scatterpie_0.1.8       shadowtext_0.1.2       scales_1.2.1          
[25] stringr_1.5.0          digest_0.6.31          yulab.utils_0.0.6     
[28] gson_0.1.0             DOSE_3.24.2            XVector_0.38.0        
[31] pkgconfig_2.0.3        fastmap_1.1.1          rlang_1.1.0           
[34] RSQLite_2.3.0          gridGraphics_0.5-1     farver_2.1.1          
[37] generics_0.1.3         jsonlite_1.8.4         BiocParallel_1.32.6   
[40] GOSemSim_2.24.0        dplyr_1.1.0            RCurl_1.98-1.10       
[43] magrittr_2.0.3         ggplotify_0.1.0        GO.db_3.16.0          
[46] GenomeInfoDbData_1.2.9 patchwork_1.1.2        Matrix_1.5-3          
[49] Rcpp_1.0.10            munsell_0.5.0          fansi_1.0.4           
[52] ape_5.7-1              viridis_0.6.2          lifecycle_1.0.3       
[55] stringi_1.7.12         ggraph_2.1.0           MASS_7.3-58.3         
[58] zlibbioc_1.44.0        plyr_1.8.8             qvalue_2.30.0         
[61] grid_4.2.2             blob_1.2.4             parallel_4.2.2        
[64] ggrepel_0.9.3          crayon_1.5.2           lattice_0.20-45       
[67] graphlayouts_0.8.4     Biostrings_2.66.0      cowplot_1.1.1         
[70] splines_4.2.2          KEGGREST_1.38.0        pillar_1.8.1          
[73] fgsea_1.24.0           igraph_1.4.1           reshape2_1.4.4        
[76] codetools_0.2-19       fastmatch_1.1-3        glue_1.6.2            
[79] ggfun_0.0.9            downloader_0.4         data.table_1.14.8     
[82] treeio_1.22.0          png_0.1-8              vctrs_0.6.0           
[85] tweenr_2.0.2           gtable_0.3.3           purrr_1.0.1           
[88] polyclip_1.10-4        tidyr_1.3.0            cachem_1.0.7          
[91] ggplot2_3.4.1          ggforce_0.4.1          tidygraph_1.2.3       
[94] tidytree_0.4.2         viridisLite_0.4.1      tibble_3.2.0          
[97] aplot_0.1.10           memoise_2.0.1         
> 
ADD REPLY
0
Entering edit mode

Wow thanks.

tools:::.BioC_version_associated_with_R_version()

[1] '3.16'

packageVersion("clusterProfiler")

[1] '4.4.4'

packageVersion("org.Hs.eg.db")

[1] '3.15.0'

My clusterProfiler and org.Hs packages are the ones outdated, compared to yours. But when I run "check for package updates" it says it is up to date.... I'll update them and try again. Thanks!

ADD REPLY
0
Entering edit mode

Do: BiocManager::install(version = "3.16").

The packages are (apparently) up-to-date within Bioconductor version 3.15...

ADD REPLY
0
Entering edit mode

I managed installing the 3.16 version of BiocManager. R said there were 91 packages to be updated, so I did. However, some warning messages appeared. One example:

package ‘GOSemSim’ successfully unpacked and MD5 sums checked

Warning: cannot remove prior installation of package ‘GOSemSim’

Warning: restored ‘GOSemSim’

Now, if I ask what version I have for the two packages (cluster and org), they are the same as yours. But I fail to call their libraries. I get this error:

> library(clusterProfiler)

Error: package or namespace load failed for ‘clusterProfiler’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]): namespace ‘GOSemSim’ 2.22.0 is already loaded, but >= 2.23.1 is required

Is it ok that the downloaded packages are all going to a Temp folder?

The downloaded source packages are in 'C:\Users\u0138385\AppData\Local\Temp\RtmpSYDDre\downloaded_packages’

Other warning messages:

Warning messages:

1: package(s) not installed when version(s) same as or greater than current; use force = TRUE to re-install: 'clusterProfiler'

2: In file.copy(savedcopy, lib, recursive = TRUE) : problem copying C:\Users\u0138385\AppData\Local\R\win-library\4.2\00LOCK\IRanges\libs\x64\IRanges.dll to C:\Users\u0138385\AppData\Local\R\win-library\4.2\IRanges\libs\x64\IRanges.dll: Permission denied

3: In file.copy(savedcopy, lib, recursive = TRUE) : problem copying C:\Users\u0138385\AppData\Local\R\win-library\4.2\00LOCK\Biostrings\libs\x64\Biostrings.dll to C:\Users\u0138385\AppData\Local\R\win-library\4.2\Biostrings\libs\x64\Biostrings.dll: Permission denied

ADD REPLY
0
Entering edit mode

This part is the main issue.

‘GOSemSim’ 2.22.0 is already loaded, but >= 2.23.1 is required

You should restart R and then re-install GOSemSim

ADD REPLY
0
Entering edit mode

I could finally solve this and I could get KEGG results :) Thank you so much for your help and quick replies.

Best, Laia

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 54 minutes ago
United States

ncbi-geneID and ENTREZID are the same thing.

ADD COMMENT

Login before adding your answer.

Traffic: 669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6