I have an experiment where each sample is measured multiple times and samples are nested within the experimental group. I have noticed that my choice of sample reference level seems to influence the results. Below is a minimal example to show what I mean:
library(edgeR) counts <- matrix(c(6,2,6,21,19,15,17,18,20,12), nrow = 1) counts <- rbind(counts, 25 - counts) setup <- data.frame(sample = factor(c("1","1","1","2","2","1","1","2","2","2")), group = factor(c(rep("a", 5), rep("b", 5)))) tester <- function(){ design <- model.matrix(~ group + group:sample, setup) y <- DGEList(counts = counts, group = setup$group, genes = c("test", "test2")) y <- estimateDisp(y, design=design, tagwise = FALSE, trend.method = "none") fit <- glmFit(y, design) topTags(glmLRT(fit, coef = 2)) } tester() # Coefficient: groupb # genes logFC logCPM LR PValue FDR # 1 test 1.750700 19.03706 16.39638 5.138319e-05 0.0001027664 # 2 test2 -1.164792 18.81775 10.49939 1.194136e-03 0.0011941364 # Set reference group to "2" instead of "1" setup$sample <- relevel(setup$sample, "2") tester() # Coefficient: groupb # genes logFC logCPM LR PValue FDR # 2 test2 0.7228214 18.81775 1.9869241 0.1586629 0.3173257 # 1 test -0.2612434 19.03706 0.7314366 0.3924179 0.3924179
I have a couple of questions: Is this intended behaviour, and if so, why?
Is my model specification incorrect or can I change it to ensure consistent results?
sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.20.9 limma_3.34.9 loaded via a namespace (and not attached): [1] compiler_3.4.3 tools_3.4.3 Rcpp_0.12.16 grid_3.4.3 locfit_1.5-9.1 lattice_0.20-35
Thank you for the comprehensive explanation, that cleared things up quite a bit. What I am really interested in would be the log-fold change in the average of B1 and B2 over the average of A1 and A2. Can I design a contrast with my current model specification that would test this? Something like
((Intercept) + groupa:sample2)/2 - (groupb + groupb:sample2)/2
. (this formulation seems wrong).Alternatively I could fit a model of the form
~ 0 + group:sample
, and then make a contrast of(b1 + b2)/2 - (a1 + a2)/2
, that should give me the difference I am after, correct?Looking at the first design matrix:
So to create the null hypothesis equivalent to
(A1 + A2)/2 = (B1 + B2)/2
, you do:... which can be simplified to:
... the left-hand-side of which is the expression to use in
makeContrasts()
.If you think that looks unnecessarily complicated, you'd be right; a far simpler approach would be to combine
group
andsample
into a single factor, and use a no-intercept model as you've propose in your second option.That is excellent, I find the construction of contrasts in R to be confusing and not all packages has the same flexibility as edgeR does so this will definitely help me in the future.