Question

Why does GSEA on edgeR results for randomized samples give highly significant p-values?

0

Entering edit mode

gene_bioconductor ▴ 10

@gene_bioconductor-22210

Last seen 7 days ago

United States

I am analyzing a set of RNA-Seq data. The sample annotation looks something like this (though with a total of 64 samples):

Sample   Patient  Disease  Treated
1        1        A        No
2        1        A        Yes
3        2        A        No
4        2        A        Yes
5        3        B        No
6        3        B        Yes
7        4        B        No
8        4        B        Yes

I am using edgeR to calculate differential expression to look at the within-patient effects of treatment:

# tx is a tximport object
library(edgeR)
# set up the DGEList object
dgel <- DGEList(counts=tx$counts, samples=sample_info, group=sample_info$Treated, genes=genes_tb)
keep <- filterByExpr(dgel, group=sample_info$Treated, min.count=10, min.total.count=20, large.n=10, min.prop=0.5)
dgel <- dgel[keep,, keep.lib.sizes=FALSE]
dgel <- normLibSizes(dgel, method="RLE")
dgel <- estimateDisp(dgel, model.matrix( ~ Patient + Treated + Disease, data=sample_info), robust=TRUE)
dgel <- equalizeLibSizes(dgel)

# perform the differential expression
fit <- glmQLFit(dgel, model.matrix(~ Patient + Treated + Disease, sample_info), robust=TRUE)
qlf <- glmQLFTest(fit, coef="TreatedYes")
summary(decideTests(qlf))
# no hits
results <- topTags(qlf, p.value=10, n=1000000)$table |> as_tibble()

# do GSEA
stats <- setNames(results$logFC, results$gene_symbol)
# pathways is a list of gene vectors from msigdb
fgsea_res <- fgsea(pathways=pathways, stats=stats, minSize=6, maxSize=300, nproc=8)
nrow(fgsea_res[padj < 0.001 ])
# 137 hits at this stringency
nrow(fgsea_res[padj < 1e-6 ])
# 23 hits at this stringency

Now I do the randomization. I want to randomize the Treatment within pairs, and then randomize Disease across pairs. Then proceed as above.

# randomize the annotation
sample_info_rand <- sample_info_rand |> 
    nest(data=c(Sample, Treated)) |> 
    # randomize `Treated` within `Patients`
    mutate(data = map(data, ~.x  |> mutate(Treated=sample(Treated)))) |> 
    # randomize `Disease` across `Patients`
    mutate(Disease = sample(Disease)) |> 
    unnest(data)

# redo the DGEList setup
dgel_rand <- DGEList(counts=tx$counts, samples=sample_info_rand, group=sample_info_rand$Treated, genes=genes_tb)
keep <- filterByExpr(dgel_rand, group=sample_info_rand$Treated, min.count=10, min.total.count=20, large.n=10, min.prop=0.5)
dgel_rand <- dgel_rand[keep,, keep.lib.sizes=FALSE]
dgel_rand <- normLibSizes(dgel_rand, method="RLE")
dgel_rand <- estimateDisp(dgel_rand, model.matrix( ~ Patient + Treated + Disease, data=sample_info_rand), robust=TRUE)
dgel_rand <- equalizeLibSizes(dgel_rand)

# perform the differential expression
fit <- glmQLFit(dgel_rand, model.matrix(~ Patient + Treated + Disease, sample_info_rand), robust=TRUE)
qlf <- glmQLFTest(fit, coef="TreatedYes")
summary(decideTests(qlf))
# no hits
results_rand <- topTags(qlf, p.value=10, n=1000000)$table |> as_tibble()

# do GSEA
stats <- setNames(results_rand$logFC, results_rand$gene_symbol)
fgsea_res_rand <- fgsea(pathways=pathways, stats=stats, minSize=6, maxSize=300, nproc=8)
nrow(fgsea_res_rand[padj < 0.001 ])
# 1979 hits at this stringency
nrow(fgsea_res_rand[padj < 0.001 ])
# 696 hits at this stringency

If I was just getting a handful of nominally significant hits with the randomized data, then sure, that happens. But I'm getting hits with adjusted p-values of 1e-45. Randomizing Treated per subject ends up with only (on average) half the values swapped, so a good chunk of the original blocking is being carried through, but I would still expect to see some diminution of signal. A full randomization of all metadata across all samples would be a stronger randomization, but I think that would be an unfair comparison since within-Patient differences are smaller than cross-Patient differences.

What are people's thoughts on this?

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.17.0   AnnotationDbi_1.62.2  IRanges_2.34.1        S4Vectors_0.38.2      Biobase_2.60.0        BiocGenerics_0.46.0   enrichplot_1.20.3     clusterProfiler_4.8.2 readxl_1.4.3         
[10] ggrepel_0.9.5         ggbeeswarm_0.7.2      glue_1.7.0            data.table_1.15.4     fgsea_1.26.0          umap_0.2.10.0         msigdbr_7.5.1         tximport_1.28.0       lubridate_1.9.3      
[19] forcats_1.0.0         stringr_1.5.1         dplyr_1.1.4           purrr_1.0.2           readr_2.1.5           tidyr_1.3.1           tibble_3.2.1          ggplot2_3.5.1         tidyverse_2.0.0      
[28] edgeR_3.42.4          limma_3.56.2         

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3      rstudioapi_0.15.0       jsonlite_1.8.8          magrittr_2.0.3          farver_2.1.1            rmarkdown_2.25          fs_1.6.3                zlibbioc_1.46.0        
  [9] vctrs_0.6.5             memoise_2.0.1           RCurl_1.98-1.14         ggtree_3.8.2            askpass_1.2.0           htmltools_0.5.8.1       cellranger_1.1.0        gridGraphics_0.5-1     
 [17] plyr_1.8.9              cachem_1.0.8            conflicted_1.2.0        igraph_2.0.3            lifecycle_1.0.4         pkgconfig_2.0.3         gson_0.1.0              Matrix_1.6-5           
 [25] R6_2.5.1                fastmap_1.1.1           GenomeInfoDbData_1.2.10 aplot_0.2.2             digest_0.6.35           colorspace_2.1-0        patchwork_1.2.0         RSpectra_0.16-1        
 [33] pkgload_1.3.4           RSQLite_2.3.6           labeling_0.4.3          fansi_1.0.6             timechange_0.3.0        httr_1.4.7              polyclip_1.10-6         compiler_4.3.1         
 [41] bit64_4.0.5             withr_3.0.0             downloader_0.4          BiocParallel_1.34.2     viridis_0.6.5           DBI_1.2.2               ggforce_0.4.2           MASS_7.3-60.0.1        
 [49] openssl_2.1.1           HDO.db_0.99.1           tools_4.3.1             vipor_0.4.7             scatterpie_0.2.1        ape_5.7-1               beeswarm_0.4.0          clipr_0.8.0            
 [57] nlme_3.1-164            GOSemSim_2.26.1         shadowtext_0.1.3        grid_4.3.1              reshape2_1.4.4          generics_0.1.3          gtable_0.3.4            tzdb_0.4.0             
 [65] hms_1.1.3               tidygraph_1.3.1         utf8_1.2.4              XVector_0.40.0          pillar_1.9.0            yulab.utils_0.1.4       babelgene_22.9          splines_4.3.1          
 [73] tweenr_2.0.2            treeio_1.24.3           lattice_0.22-6          bit_4.0.5               tidyselect_1.2.1        GO.db_3.17.0            locfit_1.5-9.8          Biostrings_2.68.1      
 [81] knitr_1.45              gridExtra_2.3           xfun_0.43               graphlayouts_1.1.1      statmod_1.5.0           stringi_1.8.3           lazyeval_0.2.2          ggfun_0.1.4            
 [89] yaml_2.3.8              evaluate_0.23           codetools_0.2-19        ggraph_2.1.0            qvalue_2.32.0           BiocManager_1.30.22     ggplotify_0.1.2         cli_3.6.2              
 [97] reticulate_1.36.1       munsell_0.5.0           Rcpp_1.0.12             GenomeInfoDb_1.36.4     png_0.1-8               parallel_4.3.1          blob_1.2.4              DOSE_3.26.1            
[105] bitops_1.0-7            tidytree_0.4.6          viridisLite_0.4.2       scales_1.3.0            crayon_1.5.2            rlang_1.1.3             cowplot_1.1.3           fastmatch_1.1-4        
[113] KEGGREST_1.40.1

fgsea edgeR • 618 views

ADD COMMENT • link 14 days ago • updated 10 days ago gene_bioconductor ▴ 10

score 3 · Answer 1 · 2024-05-02

You have discovered something that I have pointed out many times on this forum. fgsea does not implement the classic GSEA method published by the Broad Institute, which uses sample permutation, but instead implements pre-ranked GSEA, which is an unpublished method using gene permutation. Pre-ranked GSEA make the wildly unrealistic assumption that genes are statistically independent. It does not correct for inter-gene correlation and therefore gives wildy inflated statistical significance. In effect, pre-ranked GSEA is detecting gene sets that contain co-regulated genes rather than gene sets that are differentially expressed between the experimental conditions. The problem becomes more acute for larger datasets with more samples. It isn't a surprise to me that the fgsea p-values remain very small after permutation because sample permutation preserves the inter-gene correlations.

I appreciate the programming contribution of the fgsea author who had made the pre-ranked GSEA method readily available in R, which previously it was not. He has done a service to the community because now it is easy to explore the statistical properties of the method. Pre-ranked GSEA is unfortunately IMO one of the most over-used methods in the bioinformatics and biomedical literature. Unfortunately, people naturally gravitate to the statistical methods that yield the most significant results, even if not statistically valid. See some previous posts on this topic:

Another point I would make is that permutation of RNA-seq samples does not satisfy the assumptions of permutation tests and does not produce a null distribution that is anything like a real RNA-seq data without DE. See for example:

permutation test in edgeR

edgeR provides a analysis pipeline that gives good error rate control. At no stage is there any need for sample permutation. edgeR also provides methods for gene set enrichment analysis that do account for inter-gene corrrelation. See for example ?camera.DGElist or cameraPR.DGELRT. cameraPR is a direct replacement for pre-ranked GSEA but with vastly better error-rate control.