Question: edgeR: Choice of reference group in nested model
0
gravatar for ri.lars
18 months ago by
ri.lars40
ri.lars40 wrote:

I have an experiment where each sample is measured multiple times and samples are nested within the experimental group. I have noticed that my choice of sample reference level seems to influence the results. Below is a minimal example to show what I mean:

library(edgeR)
counts <- matrix(c(6,2,6,21,19,15,17,18,20,12), nrow = 1)
counts <- rbind(counts, 25 - counts)

setup <- data.frame(sample = factor(c("1","1","1","2","2","1","1","2","2","2")),
                    group = factor(c(rep("a", 5), rep("b", 5))))

tester <- function(){
  design <- model.matrix(~ group + group:sample, setup)
  y <- DGEList(counts = counts, group = setup$group, genes = c("test", "test2"))
  y <- estimateDisp(y, design=design, tagwise = FALSE, trend.method = "none")
  fit <- glmFit(y, design)
  topTags(glmLRT(fit, coef = 2))
}

tester()
# Coefficient:  groupb 
#   genes     logFC   logCPM       LR       PValue          FDR
# 1  test  1.750700 19.03706 16.39638 5.138319e-05 0.0001027664
# 2 test2 -1.164792 18.81775 10.49939 1.194136e-03 0.0011941364

# Set reference group to "2" instead of "1"
setup$sample <- relevel(setup$sample, "2")
tester()
# Coefficient:  groupb 
#   genes      logFC   logCPM        LR    PValue       FDR
# 2 test2  0.7228214 18.81775 1.9869241 0.1586629 0.3173257
# 1  test -0.2612434 19.03706 0.7314366 0.3924179 0.3924179

I have a couple of questions: Is this intended behaviour, and if so, why?

Is my model specification incorrect or can I change it to ensure consistent results?

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.20.9 limma_3.34.9

loaded via a namespace (and not attached):
[1] compiler_3.4.3  tools_3.4.3     Rcpp_0.12.16    grid_3.4.3      locfit_1.5-9.1  lattice_0.20-35
ADD COMMENTlink modified 18 months ago by Aaron Lun25k • written 18 months ago by ri.lars40
Answer: edgeR: Choice of reference group in nested model
3
gravatar for Aaron Lun
18 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Because the second coefficient represents different things in the two design matrices.

Let's look at the design matrix when 1 is the reference level for sample. Column names are:

(Intercept) groupb groupa:sample2 groupb:sample2

If you examine the entries of design, you will find that the intercept represents the average expression of the A-1 combination. The groupb coefficient represents the log-fold change in B-1 over A-1. groupa:sample2 represents the log-fold change in A-2 over A-1, while groupb:sample2 is the log-fold change in B-2 over B-1.

Now, let's have a look at the design matrix after the call to relevel(), which has column names:

(Intercept) groupb groupa:sample1 groupb:sample1

Now, the intercept represents the average expression of the A-2 combination, the new "overall baseline" across the two factors. The groupb coefficient represents the log-fold change in B-2 over A-2, which is clearly different from B-1 vs A-1. groupa:sample1 represents the log-fold change in A-1 over A-2, while groupb:sample1 represents the log-fold change in B-1 over B-2. 

In short, don't be fooled by the fact that the columns have the same names before and after relevel(). You need to look at the design matrix to figure out what the coefficients actually represent.

ADD COMMENTlink modified 18 months ago • written 18 months ago by Aaron Lun25k

Thank you for the comprehensive explanation, that cleared things up quite a bit. What I am really interested in would be the log-fold change in the average of B1 and B2 over the average of A1 and A2. Can I design a contrast with my current model specification that would test this? Something like ((Intercept) + groupa:sample2)/2 - (groupb + groupb:sample2)/2. (this formulation seems wrong).

Alternatively I could fit a model of the form ~ 0 + group:sample, and then make a contrast of (b1 + b2)/2 - (a1 + a2)/2, that should give me the difference I am after, correct?

ADD REPLYlink modified 18 months ago • written 18 months ago by ri.lars40
1

Looking at the first design matrix:

A1: Intercept
B1: Intercept + groupb
A2: Intercept + groupa:sample2
B2: Intercept + groupb + groupb:sample2

So to create the null hypothesis equivalent to (A1 + A2)/2 = (B1 + B2)/2, you do:

(Intercept + Intercept + groupa:sample2)/2
= (Intercept + groupb + Intercept + groupb + groupb:sample2)/2

... which can be simplified to:

(groupa:sample2 - groupb:sample2)/2 - groupb = 0

... the left-hand-side of which is the expression to use in makeContrasts().

If you think that looks unnecessarily complicated, you'd be right; a far simpler approach would be to combine group and sample into a single factor, and use a no-intercept model as you've propose in your second option.

ADD REPLYlink modified 18 months ago • written 18 months ago by Aaron Lun25k

That is excellent, I find the construction of contrasts in R to be confusing and not all packages has the same flexibility as edgeR does so this will definitely help me in the future.

ADD REPLYlink written 18 months ago by ri.lars40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 132 users visited in the last hour