Hi everyone!
I've been playing around a bit with clusterProfiler to work with RNAseq data. When running my own data I noticed that when I ran the same function repeatedly, I would get different results. This happens for both gseKEGG and gseGO, even when I directly copy the commands I used a minute before, different gene sets are claimed to be enriched. I tried the same with the sample data set that they use in examples for this package and saw the same effect and also tried increasing the number of permutations, which didn't really seem to help. I think I'm missing something obvious - could someone help me out?
Below is the code I was using and the output, as well as the output of sessionInfo if that helps any:
data(geneList, package="DOSE")
gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
nPerm = 1000,
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
Output 1
#
# Gene Set Enrichment Analysis
#
#...@organism Homo sapiens
#...@setType CC
#...@keytype ENTREZID
#...@geneList Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...nPerm 1000
#...pvalues adjusted by 'BH' with cutoff <0.05
#...34 enriched terms found
'data.frame': 34 obs. of 11 variables:
$ ID : chr "GO:0031012" "GO:0005578" "GO:0042383" "GO:0000776" ...
$ Description : chr "extracellular matrix" "proteinaceous extracellular matrix" "sarcolemma" "kinetochore" ...
$ setSize : int 399 268 108 103 127 101 121 109 119 139 ...
$ enrichmentScore: num -0.492 -0.557 -0.463 0.675 0.407 ...
$ NES : num -2.15 -2.36 -1.76 2.73 1.73 ...
$ pvalue : num 0.00128 0.00131 0.00146 0.00309 0.0031 ...
$ p.adjust : num 0.0295 0.0295 0.0295 0.0295 0.0295 ...
$ qvalues : num 0.0219 0.0219 0.0219 0.0219 0.0219 ...
$ rank : num 1918 1918 2526 449 1905 ...
$ leading_edge : chr "tags=35%, list=15%, signal=31%" "tags=43%, list=15%, signal=37%" "tags=34%, list=20%, signal=28%" "tags=26%, list=4%, signal=25%" ...
$ core_enrichment: chr "4017/6649/1288/4811/57124/3910/51162/3371/1973/1291/1397/1301/22918/56547/7474/3490/79875/7450/80781/79625/1490"| __truncated__ "4017/1288/4811/57124/3910/3371/1291/1301/56547/7474/79875/7450/80781/79625/1490/1280/1306/4314/80070/8425/977/4"| __truncated__ "5532/3880/6443/5530/6326/10211/825/6548/1291/7402/7779/8082/2318/80024/844/8910/79026/6717/3752/481/1293/358/47"| __truncated__ "1062/10403/55355/220134/4751/79019/55839/54821/4085/81930/81620/332/7272/9212/891/11004/5347/701/11130/79682/57"| __truncated__ ...
Output2
# Gene Set Enrichment Analysis
#
#...@organism Homo sapiens
#...@setType CC
#...@keytype ENTREZID
#...@geneList Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...nPerm 1000
#...pvalues adjusted by 'BH' with cutoff <0.05
#...39 enriched terms found
'data.frame': 39 obs. of 11 variables:
$ ID : chr "GO:0031012" "GO:0005578" "GO:0042383" "GO:0005788" ...
$ Description : chr "extracellular matrix" "proteinaceous extracellular matrix" "sarcolemma" "endoplasmic reticulum lumen" ...
$ setSize : int 399 268 108 166 103 101 109 115 115 119 ...
$ enrichmentScore: num -0.492 -0.557 -0.463 -0.397 0.675 ...
$ NES : num -2.15 -2.35 -1.76 -1.59 2.78 ...
$ pvalue : num 0.00123 0.00132 0.00153 0.00285 0.00286 ...
$ p.adjust : num 0.03 0.03 0.03 0.03 0.03 ...
$ qvalues : num 0.0221 0.0221 0.0221 0.0221 0.0221 ...
$ rank : num 1918 1918 2526 1890 449 ...
$ leading_edge : chr "tags=35%, list=15%, signal=31%" "tags=43%, list=15%, signal=37%" "tags=34%, list=20%, signal=28%" "tags=28%, list=15%, signal=24%" ...
$ core_enrichment: chr "4017/6649/1288/4811/57124/3910/51162/3371/1973/1291/1397/1301/22918/56547/7474/3490/79875/7450/80781/79625/1490"| __truncated__ "4017/1288/4811/57124/3910/3371/1291/1301/56547/7474/79875/7450/80781/79625/1490/1280/1306/4314/80070/8425/977/4"| __truncated__ "5532/3880/6443/5530/6326/10211/825/6548/1291/7402/7779/8082/2318/80024/844/8910/79026/6717/3752/481/1293/358/47"| __truncated__ "64374/845/23621/1291/1301/2098/7474/2322/5624/80781/79642/1280/1306/844/11001/5799/590/3270/1278/80267/414/1277"| __truncated__ ...
SessionInfo
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] clusterProfiler_3.4.4 DOSE_3.2.0 org.Hs.eg.db_3.4.1
[4] AnnotationDbi_1.38.2 IRanges_2.10.2 S4Vectors_0.14.3
[7] Biobase_2.36.2 BiocGenerics_0.22.0 BiocInstaller_1.26.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.12 compiler_3.4.1 plyr_1.8.4
[4] tools_3.4.1 digest_0.6.12 bit_1.1-12
[7] RSQLite_2.0 memoise_1.1.0 tibble_1.3.4
[10] gtable_0.2.0 pkgconfig_2.0.1 rlang_0.1.2
[13] fastmatch_1.1-0 igraph_1.1.2 DBI_0.7
[16] rvcheck_0.0.9 fgsea_1.2.1 gridExtra_2.2.1
[19] stringr_1.2.0 bit64_0.9-7 grid_3.4.1
[22] glue_1.1.1 qvalue_2.8.0 data.table_1.10.4
[25] BiocParallel_1.10.1 GOSemSim_2.2.0 purrr_0.2.3
[28] tidyr_0.7.0 GO.db_3.4.1 DO.db_2.9
[31] ggplot2_2.2.1 reshape2_1.4.2 blob_1.1.0
[34] magrittr_1.5 splines_3.4.1 scales_0.5.0
[37] colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0
[40] munsell_0.4.3
Thanks for your comment.
However, it does not explain why two runs of the exact same calculation will return different results even if they are 'similar'. Could you give me some insight into how this is possible and how to circumvent this issue?
I am also having the same issue where I get different numbers of enriched terms when I re-run the exact same analysis. Were you able to figure out why it does this?
No, unfortunately not. I didn't end up using the package very much after all. I'm not sure actually where it comes from. Maybe there's a seed somewhere that you can set? I'm not very familiar with the details anymore, I'm afraid...
I have the same issue.