Question

KEGG enrichment in R and gene IDs

0

Entering edit mode

Laia ▴ 10

@239caad3

Last seen 9 months ago

Belgium

Hi,

I am trying to run a KEGG enrichment analysis on my data. My genes are in SYMBOL, which I converted to ENTREZID, but I need them in "kegg" or "ncbi-geneID" to run enrichKEGG. I looked for packages that would convert these, but they are not updated: MS_convertGene, keggConv, KEGGREST, topGO.

What worked was using the website: https://www.genome.jp/kegg/tool/conv_id.html But I have many gene lists to be converted.....

Is there any tool that I could use in R to convert these IDs in a more handy way?

Thank you. Laia

KEGG enrichKEGG R entrezid • 6.0k views

ADD COMMENT • link 16 months ago Laia ▴ 10

score 0 · Answer 1 · 2023-03-23

0

Entering edit mode

Guido Hooiveld ★ 4.0k

@guido-hooiveld-2020

Last seen 11 hours ago

Wageningen University, Wageningen, the …

Please note that ENTREZID == ncbi-geneID, so it seems to me you have KEGG-compatible IDs.

Reading between the lines it seems that your actual question is on how to convert gene symbols to entrezids. If so, please post the code/way you did this now, together with the species you are working with and some example symbols.

ADD COMMENT • link 16 months ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

Hi Guido,

Thank you for your quick reply. Then if they match, I don't get any mapping. This is my code:

symbol_genes <- names(deLogFC_up)
entrez_genes <- mapIds(org.Hs.eg.db, symbol_genes, 'ENTREZID', 'SYMBOL')
kegg <- enrichKEGG(gene = entrez_genes, organism = 'hsa', keyType="kegg", pvalueCutoff = 0.05)

And this is the error message:

Reading KEGG annotation online:
No gene can be mapped... Expected input gene ID:
return NULL...
Warning messages: 1: In utils::download.file(url, quiet = TRUE, method = method, ...) : the 'wininet' method is deprecated for http:// and https:// URLs 2: In utils::download.file(url, quiet = TRUE, method = method, ...) : the 'wininet' method is deprecated for http:// and https:// URLs

Maybe it is important to mention that my gene list is 424 genes long? Could that be an issue?

Thank you. Laia

ADD REPLY • link 16 months ago Laia ▴ 10

0

Entering edit mode

Please show the output from symbol_genes[1:10] and entrez_genes[1:10]; thus (for example) the first 10 symbols resp. entrezids.

ADD REPLY • link 16 months ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

symbol_genes[1:10]

 "TREML3P"        "IL13RA2"        "MMP10"          "LOC124907722"   "AKR1C7P"        "LOC124903970"   "IL11"          
 "LINC02392"      "LOC124903636_1" "LOC124905027"  

entrez_genes[1:10]


    TREML3P        IL13RA2          MMP10   LOC124907722        AKR1C7P   LOC124903970           IL11      LINC02392 
      "340206"         "3598"         "4319"             NA       "648947"             NA         "3589"    "105369893" 
LOC124903636_1   LOC124905027 
            NA             NA

Oops. Could it be that, because the "entrez_genes" vector is named with the symbol names, that then is not working?

ADD REPLY • link 16 months ago Laia ▴ 10

0

Entering edit mode

Yep, that is the reason!

This is working (note the use of as.character(), and that in the code below no significance cutoff is applied):

> library(clusterProfiler)
> library(org.Hs.eg.db)
> 
> symbol_genes <- c("TREML3P","IL13RA2","MMP10","LOC124907722","AKR1C7P","LOC124903970","IL11","LINC02392","LOC124903636_1","LOC124905027")  
> 
> entrez_genes <- as.character( mapIds(org.Hs.eg.db, symbol_genes, 'ENTREZID', 'SYMBOL') )
'select()' returned 1:1 mapping between keys and columns
> 
> entrez_genes
 [1] "340206"    "3598"      "4319"      "124907722" "648947"    "124903970"
 [7] "3589"      "105369893" NA          "124905027"
> 
> kegg <- enrichKEGG(gene = entrez_genes, organism = 'hsa', keyType="kegg", pvalueCutoff = 1)
> 
> as.data.frame(kegg)
               ID                            Description GeneRatio  BgRatio
hsa04630 hsa04630             JAK-STAT signaling pathway       2/2 166/8292
hsa04060 hsa04060 Cytokine-cytokine receptor interaction       2/2 295/8292
hsa05323 hsa05323                   Rheumatoid arthritis       1/2  93/8292
hsa04640 hsa04640             Hematopoietic cell lineage       1/2  99/8292
              pvalue    p.adjust qvalue    geneID Count
hsa04630 0.000398406 0.001593624     NA 3598/3589     2
hsa04060 0.001261546 0.002523092     NA 3598/3589     2
hsa05323 0.022306806 0.023737315     NA      3589     1
hsa04640 0.023737315 0.023737315     NA      3589     1
> 
> packageVersion("clusterProfiler")
[1] ‘4.6.2’
> 
> 
>

ADD REPLY • link 16 months ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

I keep getting the same NULL error..

entrez_genes <- as.character(mapIds(org.Hs.eg.db, symbol_genes, 'ENTREZID', 'SYMBOL'))

'select()' returned 1:1 mapping between keys and columns

entrez_genes[1:10]

[1] "340206" "3598" "4319" NA "648947" NA "3589" "105369893" [9] NA NA

kegg <- enrichKEGG(gene = entrez_genes, organism = 'hsa', keyType="kegg", pvalueCutoff = 1)

--> No gene can be mapped.... --> Expected input gene ID: --> return NULL...

Should I remove all NA? -- Nope, this did not work neither.

Also, how come you get less NA than I do?

Yours: [1] "340206" "3598" "4319" "124907722" "648947" "124903970" [7] "3589" "105369893" NA "124905027"

Mine: [1] "340206" "3598" "4319" NA "648947" NA
[7] "3589" "105369893" NA NA

ADD REPLY • link 16 months ago Laia ▴ 10

0

Entering edit mode

No, having NA in your vector will work.

I did notice that I retrieved only a single NA (8th position), but you got 4. My Bioconductor installation is up-to-date (v3.16, note this version id in that of org.Hs.eg.db), so this suggests you Bioconductor/package installation is not. Please check, and update if needed.

> ## using your vector of entrez_genes
> entrez_genes <- c("340206", "3598" ,"4319", NA ,"648947", NA ,"3589" ,"105369893", NA, NA)
> 
> entrez_genes
 [1] "340206"    "3598"      "4319"      NA          "648947"    NA         
 [7] "3589"      "105369893" NA          NA         
> 
> ## note that @gene is 7 in length, which shows that the NA are automagically ignored (6 genes + 1 not-duplicated NA = 7).
> kegg <- enrichKEGG(gene = entrez_genes, organism = 'hsa', keyType="kegg", pvalueCutoff = 1)
> kegg
#
# over-representation test
#
#...@organism    hsa 
#...@ontology    KEGG 
#...@keytype     kegg 
#...@gene        chr [1:7] "340206" "3598" "4319" NA "648947" "3589" "105369893"
#...pvalues adjusted by 'BH' with cutoff <1 
#...4 enriched terms found
'data.frame':   4 obs. of  9 variables:
 $ ID         : chr  "hsa04630" "hsa04060" "hsa05323" "hsa04640"
 $ Description: chr  "JAK-STAT signaling pathway" "Cytokine-cytokine receptor interaction" "Rheumatoid arthritis" "Hematopoietic cell lineage"
 $ GeneRatio  : chr  "2/2" "2/2" "1/2" "1/2"
 $ BgRatio    : chr  "166/8292" "295/8292" "93/8292" "99/8292"
 $ pvalue     : num  0.000398 0.001262 0.022307 0.023737
 $ p.adjust   : num  0.00159 0.00252 0.02374 0.02374
 $ qvalue     : logi  NA NA NA NA
 $ geneID     : chr  "3598/3589" "3598/3589" "3589" "3589"
 $ Count      : int  2 2 1 1
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> packageVersion("clusterProfiler")
[1] ‘4.6.2’
> packageVersion("org.Hs.eg.db")
[1] ‘3.16.0’
> 
> ## for completeness
> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.16.0   AnnotationDbi_1.60.2  IRanges_2.32.0       
[4] S4Vectors_0.36.2      Biobase_2.58.0        BiocGenerics_0.44.0  
[7] clusterProfiler_4.6.2

loaded via a namespace (and not attached):
 [1] nlme_3.1-162           bitops_1.0-7           ggtree_3.6.2          
 [4] enrichplot_1.18.3      bit64_4.0.5            HDO.db_0.99.1         
 [7] RColorBrewer_1.1-3     httr_1.4.5             GenomeInfoDb_1.34.9   
[10] tools_4.2.2            utf8_1.2.3             R6_2.5.1              
[13] lazyeval_0.2.2         DBI_1.1.3              colorspace_2.1-0      
[16] withr_2.5.0            tidyselect_1.2.0       gridExtra_2.3         
[19] bit_4.0.5              compiler_4.2.2         cli_3.6.0             
[22] scatterpie_0.1.8       shadowtext_0.1.2       scales_1.2.1          
[25] stringr_1.5.0          digest_0.6.31          yulab.utils_0.0.6     
[28] gson_0.1.0             DOSE_3.24.2            XVector_0.38.0        
[31] pkgconfig_2.0.3        fastmap_1.1.1          rlang_1.1.0           
[34] RSQLite_2.3.0          gridGraphics_0.5-1     farver_2.1.1          
[37] generics_0.1.3         jsonlite_1.8.4         BiocParallel_1.32.6   
[40] GOSemSim_2.24.0        dplyr_1.1.0            RCurl_1.98-1.10       
[43] magrittr_2.0.3         ggplotify_0.1.0        GO.db_3.16.0          
[46] GenomeInfoDbData_1.2.9 patchwork_1.1.2        Matrix_1.5-3          
[49] Rcpp_1.0.10            munsell_0.5.0          fansi_1.0.4           
[52] ape_5.7-1              viridis_0.6.2          lifecycle_1.0.3       
[55] stringi_1.7.12         ggraph_2.1.0           MASS_7.3-58.3         
[58] zlibbioc_1.44.0        plyr_1.8.8             qvalue_2.30.0         
[61] grid_4.2.2             blob_1.2.4             parallel_4.2.2        
[64] ggrepel_0.9.3          crayon_1.5.2           lattice_0.20-45       
[67] graphlayouts_0.8.4     Biostrings_2.66.0      cowplot_1.1.1         
[70] splines_4.2.2          KEGGREST_1.38.0        pillar_1.8.1          
[73] fgsea_1.24.0           igraph_1.4.1           reshape2_1.4.4        
[76] codetools_0.2-19       fastmatch_1.1-3        glue_1.6.2            
[79] ggfun_0.0.9            downloader_0.4         data.table_1.14.8     
[82] treeio_1.22.0          png_0.1-8              vctrs_0.6.0           
[85] tweenr_2.0.2           gtable_0.3.3           purrr_1.0.1           
[88] polyclip_1.10-4        tidyr_1.3.0            cachem_1.0.7          
[91] ggplot2_3.4.1          ggforce_0.4.1          tidygraph_1.2.3       
[94] tidytree_0.4.2         viridisLite_0.4.1      tibble_3.2.0          
[97] aplot_0.1.10           memoise_2.0.1         
>

ADD REPLY • link 16 months ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

Wow thanks.

tools:::.BioC_version_associated_with_R_version()

[1] '3.16'

packageVersion("clusterProfiler")

[1] '4.4.4'

packageVersion("org.Hs.eg.db")

[1] '3.15.0'

My clusterProfiler and org.Hs packages are the ones outdated, compared to yours. But when I run "check for package updates" it says it is up to date.... I'll update them and try again. Thanks!

ADD REPLY • link 16 months ago Laia ▴ 10

0

Entering edit mode

Do: BiocManager::install(version = "3.16").

The packages are (apparently) up-to-date within Bioconductor version 3.15...

ADD REPLY • link 16 months ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

I managed installing the 3.16 version of BiocManager. R said there were 91 packages to be updated, so I did. However, some warning messages appeared. One example:

package ‘GOSemSim’ successfully unpacked and MD5 sums checked

Warning: cannot remove prior installation of package ‘GOSemSim’

Warning: restored ‘GOSemSim’

Now, if I ask what version I have for the two packages (cluster and org), they are the same as yours. But I fail to call their libraries. I get this error:

> library(clusterProfiler)

Error: package or namespace load failed for ‘clusterProfiler’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]): namespace ‘GOSemSim’ 2.22.0 is already loaded, but >= 2.23.1 is required

Is it ok that the downloaded packages are all going to a Temp folder?

The downloaded source packages are in 'C:\Users\u0138385\AppData\Local\Temp\RtmpSYDDre\downloaded_packages’

Other warning messages:

Warning messages:

1: package(s) not installed when version(s) same as or greater than current; use force = TRUE to re-install: 'clusterProfiler'

2: In file.copy(savedcopy, lib, recursive = TRUE) : problem copying C:\Users\u0138385\AppData\Local\R\win-library\4.2\00LOCK\IRanges\libs\x64\IRanges.dll to C:\Users\u0138385\AppData\Local\R\win-library\4.2\IRanges\libs\x64\IRanges.dll: Permission denied

3: In file.copy(savedcopy, lib, recursive = TRUE) : problem copying C:\Users\u0138385\AppData\Local\R\win-library\4.2\00LOCK\Biostrings\libs\x64\Biostrings.dll to C:\Users\u0138385\AppData\Local\R\win-library\4.2\Biostrings\libs\x64\Biostrings.dll: Permission denied

ADD REPLY • link 16 months ago Laia ▴ 10

0

Entering edit mode

This part is the main issue.

‘GOSemSim’ 2.22.0 is already loaded, but >= 2.23.1 is required

You should restart R and then re-install GOSemSim

ADD REPLY • link 16 months ago James W. MacDonald 66k

0

Entering edit mode

I could finally solve this and I could get KEGG results :) Thank you so much for your help and quick replies.

Best, Laia

ADD REPLY • link 16 months ago Laia ▴ 10

score 0 · Answer 2 · 2023-03-23

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

ncbi-geneID and ENTREZID are the same thing.

ADD COMMENT • link 16 months ago James W. MacDonald 66k