I analyse bulk RNA-seq data from 21 samples in seven groups ('Group'), where each of the replicates originate from a different source ("Source"). Six contrasts are evaluated. From the MDS plot I can see a strong influence of the 'Source' variable.
I use the limma-voom
workflow with three different models:
library(limma)
library(edgeR)
[...]
samples <- data.frame(Group=as.factor(rep(paste0("G", 1:7), rep=3)), Source=as.factor(rep(c(3,1,2), each=7)))
levels(samples$Source) <- c("2","3","1")
# Model 1
design <- model.matrix(~0+Group, data=samples)
colnames(design) <- sub("Group", "", colnames(design))
dge.norm.voom <- voom(dge.norm, design, plot=TRUE)
dge.norm.voom.fit <- lmFit(dge.norm.voom)
# Model 2
design <- model.matrix(~0+Group+Source, data=samples)
colnames(design) <- sub("Group", "", colnames(design))
dge.norm.voom <- voom(dge.norm, design, plot=TRUE)
dge.norm.voom.fit <- lmFit(dge.norm.voom)
# Model 3
design <- model.matrix(~0+Group, data=samples)
colnames(design) <- sub("Group", "", colnames(design))
dge.norm.voom <- voom(dge.norm, design, plot=TRUE)
dge.norm.voom.cor <- duplicateCorrelation(dge.norm.voom, design, block=samples$Source)
dge.norm.voom.cor$consensus.correlation
# 0.290429
dge.norm.voom.fit <- lmFit(dge.norm.voom, design, block=samples$Source,
correlation=dge.norm.voom.cor$consensus.correlation)
# Common to all models
contr.matrix <- makeContrasts(C1=G1-G7,C2=G2-G7,C3=G3-G7,C4=G4-G7,C5=G5-G7,C6=G6-G7, levels = colnames(design))
dge.norm.voom.fit <- contrasts.fit(dge.norm.voom.fit, contrasts=contr.matrix)
efit <- eBayes(dge.norm.voom.fit)
efit.dt <- decideTests(efit)
efit.dt.sum <- t(summary(efit.dt))
print(efit.dt.sum)
The three models yield the following numbers of differentially expressed genes:
~0+Group | ~0+Group+Source | ~0+Group (block=Source)
Contrast Down NotSig Up | Down NotSig Up | Down NotSig Up
C1 53 18552 20 | 676 17500 449 | 571 17818 236
C2 239 18322 64 | 1113 16828 684 | 962 17255 408
C3 490 17936 199 | 1626 15846 1153 | 1535 16311 779
C4 60 18539 26 | 450 17794 381 | 326 18093 206
C5 59 18498 68 | 486 17521 618 | 410 17801 414
C6 46 18557 22 | 369 17880 376 | 225 18241 159
Treating "Source" as a "not of interest" factor (Model #2) yields consistently more differentially expressed genes than using it as a blocking variable (Model #3).
Is this expected or may this indicate an error in the analysis?
The environment is as follows:
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
locale:
[1] LC_COLLATE=English_Austria.utf8 LC_CTYPE=English_Austria.utf8 LC_MONETARY=English_Austria.utf8 LC_NUMERIC=C LC_TIME=English_Austria.utf8
time zone: Europe/Vienna
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.42.4 limma_3.56.2
loaded via a namespace (and not attached):
[1] compiler_4.3.0 Rcpp_1.0.10 grid_4.3.0 locfit_1.5-9.8 statmod_1.5.0 lattice_0.21-8
It's perfectly expected. Essentially it's a paired analysis removing the confounding effect of "Source".