I got some E. coli K12 sequencing results where the mutated genes were labelled with their gene locus ID (f.i. b0014) and the gene name (f.i. dnaK), but without the gene ID. If I want to do GO for instance with clusterProfiler it seems like all the functions only allow a list of gene ID as inputs. Using gseGO() it didn't work with either using gene names or gene locus IDs. An example of what I tried:
express = runif(10,0,100)
namelist=c("dnaK","dnaJ","hokC","mokC","insB1","insA1","rpsT","insL1","yaaW","yaaI")
names(express)=namelist
express = sort(express)
express = rev(express)
library(clusterProfiler)
BiocManager::install("org.EcK12.eg.db")
gse = gseGO(express, ont="MF",keyType="GENENAME",OrgDb="org.EcK12.eg.db")
There should be a bioconductor function which adds the gene ID to a list of gene locus IDs or gene names I assume. Which one is that?
sessionInfo( )
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] clusterProfiler_4.6.0 AnnotationHub_3.6.0 BiocFileCache_2.6.0
[4] dbplyr_2.2.1 GOSemSim_2.24.0 GO.db_3.16.0
[7] AnnotationDbi_1.60.0 IRanges_2.32.0 S4Vectors_0.36.0
[10] Biobase_2.58.0 BiocGenerics_0.44.0
loaded via a namespace (and not attached):
[1] fgsea_1.24.0 colorspace_2.0-3
[3] ggtree_3.6.2 gson_0.0.9
[5] ellipsis_0.3.2 qvalue_2.30.0
[7] XVector_0.38.0 aplot_0.1.8
[9] farver_2.1.1 graphlayouts_0.8.3
[11] ggrepel_0.9.2 bit64_4.0.5
[13] scatterpie_0.1.8 interactiveDisplayBase_1.36.0
[15] fansi_1.0.3 codetools_0.2-18
[17] splines_4.2.2 cachem_1.0.6
[19] polyclip_1.10-4 jsonlite_1.8.3
[21] png_0.1-7 ggforce_0.4.1
[23] shiny_1.7.3 BiocManager_1.30.19
[25] compiler_4.2.2 httr_1.4.4
[27] lazyeval_0.2.2 assertthat_0.2.1
[29] Matrix_1.5-3 fastmap_1.1.0
[31] cli_3.4.1 later_1.3.0
[33] tweenr_2.0.2 htmltools_0.5.3
[35] tools_4.2.2 igraph_1.3.5
[37] gtable_0.3.1 glue_1.6.2
[39] GenomeInfoDbData_1.2.9 reshape2_1.4.4
[41] dplyr_1.0.10 rappdirs_0.3.3
[43] fastmatch_1.1-3 Rcpp_1.0.9
[45] enrichplot_1.18.0 vctrs_0.5.1
[47] Biostrings_2.66.0 ape_5.6-2
[49] nlme_3.1-160 ggraph_2.1.0
[51] stringr_1.4.1 mime_0.12
[53] lifecycle_1.0.3 DOSE_3.24.1
[55] org.Hs.eg.db_3.16.0 zlibbioc_1.44.0
[57] MASS_7.3-58.1 scales_1.2.1
[59] tidygraph_1.2.2 promises_1.2.0.1
[61] parallel_4.2.2 RColorBrewer_1.1-3
[63] yaml_2.3.6 curl_4.3.3
[65] memoise_2.0.1 gridExtra_2.3
[67] ggplot2_3.4.0 downloader_0.4
[69] ggfun_0.0.8 HDO.db_0.99.1
[71] yulab.utils_0.0.5 stringi_1.7.8
[73] RSQLite_2.2.18 BiocVersion_3.16.0
[75] tidytree_0.4.1 filelock_1.0.2
[77] BiocParallel_1.32.1 org.EcK12.eg.db_3.16.0
[79] GenomeInfoDb_1.34.3 rlang_1.0.6
[81] pkgconfig_2.0.3 bitops_1.0-7
[83] lattice_0.20-45 purrr_0.3.5
[85] treeio_1.22.0 patchwork_1.1.2
[87] shadowtext_0.1.2 cowplot_1.1.1
[89] bit_4.0.5 tidyselect_1.2.0
[91] plyr_1.8.8 magrittr_2.0.3
[93] R6_2.5.1 generics_0.1.3
[95] DBI_1.1.3 pillar_1.8.1
[97] withr_2.5.0 KEGGREST_1.38.0
[99] RCurl_1.98-1.9 tibble_3.1.8
[101] crayon_1.5.2 utf8_1.2.2
[103] viridis_0.6.2 grid_4.2.2
[105] data.table_1.14.6 blob_1.2.3
[107] digest_0.6.30 xtable_1.8-4
[109] tidyr_1.2.1 httpuv_1.6.6
[111] gridGraphics_0.5-1 munsell_0.5.0
[113] viridisLite_0.4.1 ggplotify_0.1.0
Thank you for the answer, Guido! This worked for me, but I'll have to dig a bit deeper on the p-value aspect. So, am I understanding it correctly, that it found the corresponding GO, but doesn't list the results in gse, because they didn't make it though the p-value constraints? I guess this p-value describes the significance of the fold change, so shouldn't it be independent of how the size of my dataset? And further shouldn't it show all the results with the cutoff set to 1?
It will probably be relevant for me, because I'm actually not going to do an enrichment analysis, but rather compare frequencies of mutations appearing at different loci within a population across multiple samples. So, the fold change in my case is a percentage and the p-Value I can ignore as I understand it.
It is not fully clear to me what exactly you try to achieve, but the function
gseGO
performs a gene set enrichment analysis (GSEA), in which GO categories are used as gene sets. The p-value is indicative of how likely all 'members' of a gene set are enriched on the top (or bottom) of you ranked input list. Since only 10 genes were used as input, no gene set was found to be enriched (which obviously makes sense).