Adding geneID to list of gene locus IDs
Entering edit mode
Chruut • 0
Last seen 11 days ago

I got some E. coli K12 sequencing results where the mutated genes were labelled with their gene locus ID (f.i. b0014) and the gene name (f.i. dnaK), but without the gene ID. If I want to do GO for instance with clusterProfiler it seems like all the functions only allow a list of gene ID as inputs. Using gseGO() it didn't work with either using gene names or gene locus IDs. An example of what I tried:

express = runif(10,0,100)
express = sort(express)
express = rev(express)
gse = gseGO(express, ont="MF",keyType="GENENAME",OrgDb="")

There should be a bioconductor function which adds the gene ID to a list of gene locus IDs or gene names I assume. Which one is that?

sessionInfo( )

R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] clusterProfiler_4.6.0 AnnotationHub_3.6.0   BiocFileCache_2.6.0  
 [4] dbplyr_2.2.1          GOSemSim_2.24.0       GO.db_3.16.0         
 [7] AnnotationDbi_1.60.0  IRanges_2.32.0        S4Vectors_0.36.0     
[10] Biobase_2.58.0        BiocGenerics_0.44.0  

loaded via a namespace (and not attached):
  [1] fgsea_1.24.0                  colorspace_2.0-3             
  [3] ggtree_3.6.2                  gson_0.0.9                   
  [5] ellipsis_0.3.2                qvalue_2.30.0                
  [7] XVector_0.38.0                aplot_0.1.8                  
  [9] farver_2.1.1                  graphlayouts_0.8.3           
 [11] ggrepel_0.9.2                 bit64_4.0.5                  
 [13] scatterpie_0.1.8              interactiveDisplayBase_1.36.0
 [15] fansi_1.0.3                   codetools_0.2-18             
 [17] splines_4.2.2                 cachem_1.0.6                 
 [19] polyclip_1.10-4               jsonlite_1.8.3               
 [21] png_0.1-7                     ggforce_0.4.1                
 [23] shiny_1.7.3                   BiocManager_1.30.19          
 [25] compiler_4.2.2                httr_1.4.4                   
 [27] lazyeval_0.2.2                assertthat_0.2.1             
 [29] Matrix_1.5-3                  fastmap_1.1.0                
 [31] cli_3.4.1                     later_1.3.0                  
 [33] tweenr_2.0.2                  htmltools_0.5.3              
 [35] tools_4.2.2                   igraph_1.3.5                 
 [37] gtable_0.3.1                  glue_1.6.2                   
 [39] GenomeInfoDbData_1.2.9        reshape2_1.4.4               
 [41] dplyr_1.0.10                  rappdirs_0.3.3               
 [43] fastmatch_1.1-3               Rcpp_1.0.9                   
 [45] enrichplot_1.18.0             vctrs_0.5.1                  
 [47] Biostrings_2.66.0             ape_5.6-2                    
 [49] nlme_3.1-160                  ggraph_2.1.0                 
 [51] stringr_1.4.1                 mime_0.12                    
 [53] lifecycle_1.0.3               DOSE_3.24.1                  
 [55]           zlibbioc_1.44.0              
 [57] MASS_7.3-58.1                 scales_1.2.1                 
 [59] tidygraph_1.2.2               promises_1.2.0.1             
 [61] parallel_4.2.2                RColorBrewer_1.1-3           
 [63] yaml_2.3.6                    curl_4.3.3                   
 [65] memoise_2.0.1                 gridExtra_2.3                
 [67] ggplot2_3.4.0                 downloader_0.4               
 [69] ggfun_0.0.8                   HDO.db_0.99.1                
 [71] yulab.utils_0.0.5             stringi_1.7.8                
 [73] RSQLite_2.2.18                BiocVersion_3.16.0           
 [75] tidytree_0.4.1                filelock_1.0.2               
 [77] BiocParallel_1.32.1        
 [79] GenomeInfoDb_1.34.3           rlang_1.0.6                  
 [81] pkgconfig_2.0.3               bitops_1.0-7                 
 [83] lattice_0.20-45               purrr_0.3.5                  
 [85] treeio_1.22.0                 patchwork_1.1.2              
 [87] shadowtext_0.1.2              cowplot_1.1.1                
 [89] bit_4.0.5                     tidyselect_1.2.0             
 [91] plyr_1.8.8                    magrittr_2.0.3               
 [93] R6_2.5.1                      generics_0.1.3               
 [95] DBI_1.1.3                     pillar_1.8.1                 
 [97] withr_2.5.0                   KEGGREST_1.38.0              
 [99] RCurl_1.98-1.9                tibble_3.1.8                 
[101] crayon_1.5.2                  utf8_1.2.2                   
[103] viridis_0.6.2                 grid_4.2.2                   
[105] data.table_1.14.6             blob_1.2.3                   
[107] digest_0.6.30                 xtable_1.8-4                 
[109] tidyr_1.2.1                   httpuv_1.6.6                 
[111] gridGraphics_0.5-1            munsell_0.5.0                
[113] viridisLite_0.4.1             ggplotify_0.1.0
cluster OrgDb clusterProfiler • 91 views
Entering edit mode
Guido Hooiveld ★ 3.4k
Last seen 1 day ago
Wageningen University, Wageningen, the …

When calling the function gseGO you should set the argument keyType="SYMBOL" (and not to the GENENAME).

(You can ignore the warnings because a very limited data set was analyzed, but it shows gseGO is working fine now).

> library(
> k <- keys(
> # check the content of the annotation database; which column to use
> AnnotationDbi::select(, keys=k[1:5], columns = c('ENTREZID', 'SYMBOL', 'GENENAME'), keytype = "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
1   944740   yjhR
2   944741   nfrA
3   944742   thrL
4   944743  insB1
5   944744   sspA
1                                                         pseudo
2 exopolysaccharide secretion system outer membrane protein NfrA
3                                      thr operon leader peptide
4                                        IS1 family protein InsB
5                                 stringent starvation protein A
> gse = gseGO(express, ont="MF",keyType="SYMBOL",OrgDb="", pvalueCutoff = 1)
preparing geneSet collections...
GSEA analysis...
no term enriched under specific pvalueCutoff...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  All values in the stats vector are greater than zero and scoreType is "std", maybe you should switch to scoreType = "pos".
> gse
# Gene Set Enrichment Analysis
#...@organism    Escherichia coli 
#...@setType     MF 
#...@keytype     SYMBOL 
#...@geneList    Named num [1:10] 94.6 86.8 73.5 66.7 54.3 ...
 - attr(*, "names")= chr [1:10] "yaaI" "hokC" "dnaK" "insA1" ...
#...pvalues adjusted by 'BH' with cutoff <1 
#...0 enriched terms found
'data.frame':   0 obs. of  8 variables:
 $ ID             : chr 
 $ Description    : chr 
 $ setSize        : int 
 $ enrichmentScore: num 
 $ NES            : num 
 $ pvalue         : num 
 $ p.adjust       : num 
 $ qvalue         : num 
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

Entering edit mode

Thank you for the answer, Guido! This worked for me, but I'll have to dig a bit deeper on the p-value aspect. So, am I understanding it correctly, that it found the corresponding GO, but doesn't list the results in gse, because they didn't make it though the p-value constraints? I guess this p-value describes the significance of the fold change, so shouldn't it be independent of how the size of my dataset? And further shouldn't it show all the results with the cutoff set to 1?

It will probably be relevant for me, because I'm actually not going to do an enrichment analysis, but rather compare frequencies of mutations appearing at different loci within a population across multiple samples. So, the fold change in my case is a percentage and the p-Value I can ignore as I understand it.

Entering edit mode

It is not fully clear to me what exactly you try to achieve, but the function gseGO performs a gene set enrichment analysis (GSEA), in which GO categories are used as gene sets. The p-value is indicative of how likely all 'members' of a gene set are enriched on the top (or bottom) of you ranked input list. Since only 10 genes were used as input, no gene set was found to be enriched (which obviously makes sense).


Login before adding your answer.

Traffic: 533 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6