Hello,
I have a cohort of samples that are ostensibly of the same type, but some express high levels of unique genes, which make these individuals worthy of further investigation. As such, I would like to compare each single sample to the rest, in order to find the interesting ones, then specifically identify which genes differ most from the population.
I know it is not ideal to have no replicates, but I was thinking of fitting a series of linear models, one per sample, and then globally adjusting all of the resulting p-values, to take into account the extra multiple testing. Does this approach sound vaguely sensible? Any feedback or other suggestions would be gratefully received.
Thanks!
Jamie.
library(limma)
## Cohort
nsamples <- 50
ngenes <- 100
## Expression
set.seed(12)
sd <- 0.3*sqrt(4/rchisq(ngenes, df = 4))
set.seed(12)
y <- matrix(rnorm(ngenes*nsamples, sd = sd), ngenes, nsamples)
rownames(y) <- paste0("Gene", 1:ngenes)
colnames(y) <- paste0("Sample", 1: nsamples)
## Increase expression
# Gene 2 higher in Sample 2
# Gene 3 higher in Sample 3
y[2, 2] <- y[2, 2] + 5
y[3, 3] <- y[3, 3] + 5
## lmFit for each sample versus the rest
fit2.list <- lapply(1:ncol(y), function(i){
design <- matrix(c(1, 0), ncol = 2, nrow = ncol(y), byrow = TRUE,
dimnames = list(colnames(y), c("(Intercept)", colnames(y)[i])))
design[i, 2] <- 1
fit <- lmFit(y, design)
fit2 <- contrasts.fit(fit, coefficients = 2)
eBayes(fit2)
})
## Adjust p-values globally
global.p <- lapply(fit2.list, "[[", "p.value")
global.p <- do.call(global.p, what = "cbind")
global.p[] <- p.adjust(global.p, method = "BH")
which(global.p < 0.05, arr.ind = TRUE)
# row col
# Gene2 2 2
# Gene3 3 3
global.p[1:3, 1:3]
# Sample1 Sample2 Sample3
# Gene1 0.980 9.80e-01 9.50e-01
# Gene2 0.996 2.79e-11 9.96e-01
# Gene3 0.996 9.96e-01 1.68e-23
sessionInfo()
# R version 4.4.0 (2024-04-24)
# Platform: aarch64-apple-darwin20
# Running under: macOS Sonoma 14.4.1
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# time zone: Europe/Berlin
# tzcode source: internal
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] limma_3.60.0
# loaded via a namespace (and not attached):
# [1] compiler_4.4.0 statmod_1.5.0

Great. Yes, something descriptive is probably good enough, even if the p-values themselves are not so rigorous. Thanks very much Gordon!