I am analyzing a set of RNA-Seq data. The sample annotation looks something like this (though with a total of 64 samples):
Sample Patient Disease Treated
1 1 A No
2 1 A Yes
3 2 A No
4 2 A Yes
5 3 B No
6 3 B Yes
7 4 B No
8 4 B Yes
I am using edgeR
to calculate differential expression to look at the within-patient effects of treatment:
# tx is a tximport object
library(edgeR)
# set up the DGEList object
dgel <- DGEList(counts=tx$counts, samples=sample_info, group=sample_info$Treated, genes=genes_tb)
keep <- filterByExpr(dgel, group=sample_info$Treated, min.count=10, min.total.count=20, large.n=10, min.prop=0.5)
dgel <- dgel[keep,, keep.lib.sizes=FALSE]
dgel <- normLibSizes(dgel, method="RLE")
dgel <- estimateDisp(dgel, model.matrix( ~ Patient + Treated + Disease, data=sample_info), robust=TRUE)
dgel <- equalizeLibSizes(dgel)
# perform the differential expression
fit <- glmQLFit(dgel, model.matrix(~ Patient + Treated + Disease, sample_info), robust=TRUE)
qlf <- glmQLFTest(fit, coef="TreatedYes")
summary(decideTests(qlf))
# no hits
results <- topTags(qlf, p.value=10, n=1000000)$table |> as_tibble()
# do GSEA
stats <- setNames(results$logFC, results$gene_symbol)
# pathways is a list of gene vectors from msigdb
fgsea_res <- fgsea(pathways=pathways, stats=stats, minSize=6, maxSize=300, nproc=8)
nrow(fgsea_res[padj < 0.001 ])
# 137 hits at this stringency
nrow(fgsea_res[padj < 1e-6 ])
# 23 hits at this stringency
Now I do the randomization. I want to randomize the Treatment within pairs, and then randomize Disease across pairs. Then proceed as above.
# randomize the annotation
sample_info_rand <- sample_info_rand |>
nest(data=c(Sample, Treated)) |>
# randomize `Treated` within `Patients`
mutate(data = map(data, ~.x |> mutate(Treated=sample(Treated)))) |>
# randomize `Disease` across `Patients`
mutate(Disease = sample(Disease)) |>
unnest(data)
# redo the DGEList setup
dgel_rand <- DGEList(counts=tx$counts, samples=sample_info_rand, group=sample_info_rand$Treated, genes=genes_tb)
keep <- filterByExpr(dgel_rand, group=sample_info_rand$Treated, min.count=10, min.total.count=20, large.n=10, min.prop=0.5)
dgel_rand <- dgel_rand[keep,, keep.lib.sizes=FALSE]
dgel_rand <- normLibSizes(dgel_rand, method="RLE")
dgel_rand <- estimateDisp(dgel_rand, model.matrix( ~ Patient + Treated + Disease, data=sample_info_rand), robust=TRUE)
dgel_rand <- equalizeLibSizes(dgel_rand)
# perform the differential expression
fit <- glmQLFit(dgel_rand, model.matrix(~ Patient + Treated + Disease, sample_info_rand), robust=TRUE)
qlf <- glmQLFTest(fit, coef="TreatedYes")
summary(decideTests(qlf))
# no hits
results_rand <- topTags(qlf, p.value=10, n=1000000)$table |> as_tibble()
# do GSEA
stats <- setNames(results_rand$logFC, results_rand$gene_symbol)
fgsea_res_rand <- fgsea(pathways=pathways, stats=stats, minSize=6, maxSize=300, nproc=8)
nrow(fgsea_res_rand[padj < 0.001 ])
# 1979 hits at this stringency
nrow(fgsea_res_rand[padj < 0.001 ])
# 696 hits at this stringency
If I was just getting a handful of nominally significant hits with the randomized data, then sure, that happens. But I'm getting hits with adjusted p-values of 1e-45. Randomizing Treated
per subject ends up with only (on average) half the values swapped, so a good chunk of the original blocking is being carried through, but I would still expect to see some diminution of signal. A full randomization of all metadata across all samples would be a stronger randomization, but I think that would be an unfair comparison since within-Patient differences are smaller than cross-Patient differences.
What are people's thoughts on this?
> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] org.Hs.eg.db_3.17.0 AnnotationDbi_1.62.2 IRanges_2.34.1 S4Vectors_0.38.2 Biobase_2.60.0 BiocGenerics_0.46.0 enrichplot_1.20.3 clusterProfiler_4.8.2 readxl_1.4.3
[10] ggrepel_0.9.5 ggbeeswarm_0.7.2 glue_1.7.0 data.table_1.15.4 fgsea_1.26.0 umap_0.2.10.0 msigdbr_7.5.1 tximport_1.28.0 lubridate_1.9.3
[19] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
[28] edgeR_3.42.4 limma_3.56.2
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.8 magrittr_2.0.3 farver_2.1.1 rmarkdown_2.25 fs_1.6.3 zlibbioc_1.46.0
[9] vctrs_0.6.5 memoise_2.0.1 RCurl_1.98-1.14 ggtree_3.8.2 askpass_1.2.0 htmltools_0.5.8.1 cellranger_1.1.0 gridGraphics_0.5-1
[17] plyr_1.8.9 cachem_1.0.8 conflicted_1.2.0 igraph_2.0.3 lifecycle_1.0.4 pkgconfig_2.0.3 gson_0.1.0 Matrix_1.6-5
[25] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.10 aplot_0.2.2 digest_0.6.35 colorspace_2.1-0 patchwork_1.2.0 RSpectra_0.16-1
[33] pkgload_1.3.4 RSQLite_2.3.6 labeling_0.4.3 fansi_1.0.6 timechange_0.3.0 httr_1.4.7 polyclip_1.10-6 compiler_4.3.1
[41] bit64_4.0.5 withr_3.0.0 downloader_0.4 BiocParallel_1.34.2 viridis_0.6.5 DBI_1.2.2 ggforce_0.4.2 MASS_7.3-60.0.1
[49] openssl_2.1.1 HDO.db_0.99.1 tools_4.3.1 vipor_0.4.7 scatterpie_0.2.1 ape_5.7-1 beeswarm_0.4.0 clipr_0.8.0
[57] nlme_3.1-164 GOSemSim_2.26.1 shadowtext_0.1.3 grid_4.3.1 reshape2_1.4.4 generics_0.1.3 gtable_0.3.4 tzdb_0.4.0
[65] hms_1.1.3 tidygraph_1.3.1 utf8_1.2.4 XVector_0.40.0 pillar_1.9.0 yulab.utils_0.1.4 babelgene_22.9 splines_4.3.1
[73] tweenr_2.0.2 treeio_1.24.3 lattice_0.22-6 bit_4.0.5 tidyselect_1.2.1 GO.db_3.17.0 locfit_1.5-9.8 Biostrings_2.68.1
[81] knitr_1.45 gridExtra_2.3 xfun_0.43 graphlayouts_1.1.1 statmod_1.5.0 stringi_1.8.3 lazyeval_0.2.2 ggfun_0.1.4
[89] yaml_2.3.8 evaluate_0.23 codetools_0.2-19 ggraph_2.1.0 qvalue_2.32.0 BiocManager_1.30.22 ggplotify_0.1.2 cli_3.6.2
[97] reticulate_1.36.1 munsell_0.5.0 Rcpp_1.0.12 GenomeInfoDb_1.36.4 png_0.1-8 parallel_4.3.1 blob_1.2.4 DOSE_3.26.1
[105] bitops_1.0-7 tidytree_0.4.6 viridisLite_0.4.2 scales_1.3.0 crayon_1.5.2 rlang_1.1.3 cowplot_1.1.3 fastmatch_1.1-4
[113] KEGGREST_1.40.1
Thanks for your thoughtful and detailed comment, Gordon