Question re "Group-specific condition effects, individuals nested within groups"
1
0
Entering edit mode
fnigsch • 0
@fnigsch-18127
Last seen 2.9 years ago

Hi,

I am working with a dataset that has precisely the nested structure as exemplified in the section "Group-specific condition effects, individuals nested within groups" of the DESeq2 RNAseq tutorial: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#group-specific-condition-effects-individuals-nested-within-groups

The structure is the same, but I do have more groups, a variable number of individuals per group, and two conditions per invidividual. (I therefore removed all the columns from the design matrix with all zeros for coefficients that do not exist.)

Working through my data, I fit a model exactly as the one listed in the tutorial:

model.matrix(~ grp + grp:ind.n + grp:cnd, coldata)

I got the results for the contrasts that I am interested in via results(). When inspecting the results, and subsequently playing with some toy data, I got some doubts about the correctness of what I did.

My toy example:

Two groups (A, B), three individuals in each (1, 2, 3), two conditions per individual (x, y), two replicates per (group, individual, condition), values of condition y are ~2 higher than condition x:

df <- data.frame(     grp=factor(rep(c("A", "B"), each=12)),     cond=factor(rep(rep(c("x", "y"), each=2), 6)),     ind=factor(rep(1:3, each=4)),     value=rnorm(24, 10) +         c(rnorm(12, -5), rnorm(12, 5)) +         ifelse(df$cond == "x", 0, rnorm(24, 2, 0.5)) ) I then fit two linear models. Here the first one, plus the coefficients: lm(value ~ grp + grp:ind + grp:cond, df) Call: lm(formula = value ~ grp + grp:ind + grp:cond, data = df) Coefficients: (Intercept) grpB grpA:ind2 grpB:ind2 grpA:ind3 grpB:ind3 grpA:condy grpB:condy 5.6597 9.6079 -0.1103 1.4499 0.3759 0.4865 2.1481 2.3591  And here the second model, excluding the grp:ind term: lm(value ~ grp + grp:cond, df) Call: lm(formula = value ~ grp + grp:cond, data = df) Coefficients: (Intercept) grpB grpA:condy grpB:condy 5.748 10.165 2.148 2.359  My questions to the above: 1) What is the meaning of the intercept in either of these models? 2) The condition-specific effects per group (grpA:condy, grpB:condy) are the same for both models. If it was taken into account that there are paired samples (for each individual) then this should not be the case. What am I missing? Any help greatly appreciated! deseq2 linear model • 428 views ADD COMMENT 0 Entering edit mode @james-w-macdonald-5106 Last seen 9 hours ago United States You can infer what the coefficients are, based on a bit of algebra. Note that the 0 and 1 entries in a model matrix indicate if a coefficient is estimated for a given sample or not. So for any set of rows that has only one 1, the coefficient for those rows is the mean of the group defined by the rows. Easier shown than described; > mod1 <- model.matrix(~grp + grp:cond, df) > mod2 <- model.matrix(~grp + grp:ind + grp:cond, df) > mod1 (Intercept) grpB grpA:condy grpB:condy 1 1 0 0 0 2 1 0 0 0 3 1 0 1 0 4 1 0 1 0 5 1 0 0 0 6 1 0 0 0 7 1 0 1 0 8 1 0 1 0 9 1 0 0 0 10 1 0 0 0 11 1 0 1 0 12 1 0 1 0 13 1 1 0 0 14 1 1 0 0 15 1 1 0 1 16 1 1 0 1 17 1 1 0 0 18 1 1 0 0 19 1 1 0 1 20 1 1 0 1 21 1 1 0 0 22 1 1 0 0 23 1 1 0 1 24 1 1 0 1 So if we extract the rows with just one 1 (these are the intercept-only rows), we get > df[rowSums(mod1) == 1,] grp cond ind value 1 A x 1 5.422499 2 A x 1 3.027077 5 A x 2 3.360429 6 A x 2 4.166690 9 A x 3 4.230351 10 A x 3 5.021791 And we can then infer that the intercept is the mean of the subjects from Group A, condition x, where we are computing the mean over all individuals. Figuring out the rest is simple algebra. For example, row 13 is an individual from Group B, condition x. If we substitute Grp_B_cond_x as shorthand, we have Grp_B_cond_x = Grp_A_cond_x + X Because that row has a coefficient for the intercept, and a coefficient for grpB. Solving for X gives us X = Grp_B_cond_x - Grp_A_cond_x So the 'grpB' coefficient is the difference between the Group B, condition x samples and the Group A, condition x samples. You can figure out the other two similarly (they are the difference between the Group A, condition y and Group A, condition x, and the same difference for the Group B samples). You can figure out the coefficients for the larger model similarly (I leave that to the reader as an exercise), but what you will find is that for the larger model the grpA:condy and grpB:condy coefficients are computing the same thing as in the smaller model, which is why you get the same coefficient. But that's only half of the story! (Continued below) ADD COMMENT 0 Entering edit mode The denominator of the contrast(s) you will compute is based on the within-group variabiliity (which is supposed to tell you something about the within-population variability), and is used to determine if the t-statistic you compute is unexpectedly large, given the null distribution. If you use the smaller model that ignores the pairing, when you compute the within-group variability you ignore the fact that some of the subjects are the same person, and are expected to be correlated (e.g., have lower variance within subject as compared to between subject). In one sense that's cheating because you may have smaller variance estimates, and more degrees of freedom (you lose one degree of freedom for each coefficient estimated). On the other hand, if there are subject-specific differences that you might want to account for, you may have larger within-group variances than what you would get if you accounted for the pairing. As an example, let's add some sample-specific changes in the expression values > df2 <- df > df2$value <- df2\$value + rep(c(3,-1,2,-3,5,1), each = 4)

And now if we fit the smaller model, we get

> summary(lm(value~grp+ grp:cond, df))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   4.2048     0.6972   6.031 6.76e-06 ***
grpB         10.7152     0.9859  10.868 7.67e-10 ***
grpA:condy    2.3083     0.9859   2.341   0.0297 *
grpB:condy    1.3924     0.9859   1.412   0.1732

> summary(lm(value~grp+ grp:cond, df2))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    5.538      1.268   4.368 0.000298 ***
grpB          10.382      1.793   5.790 1.15e-05 ***
grpA:condy     2.308      1.793   1.287 0.212701
grpB:condy     1.392      1.793   0.776 0.446552
---


Where the extra variability really messes things up if we ignore the pairing. But if we include the pairing, there's no problem

> summary(lm(value~grp+grp:ind + grp:cond, df))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.4706     1.0080   3.443  0.00334 **
grpB         11.1166     1.4255   7.799 7.71e-07 ***
grpA:ind2     1.8703     1.2345   1.515  0.14928
grpB:ind2     0.1588     1.2345   0.129  0.89924
grpA:ind3     0.3322     1.2345   0.269  0.79128
grpB:ind3     0.8393     1.2345   0.680  0.50630
grpA:condy    2.3083     1.0080   2.290  0.03594 *
grpB:condy    1.3924     1.0080   1.381  0.18615
---

> summary(lm(value~grp+grp:ind + grp:cond, df2))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   6.4706     1.0080   6.420 8.48e-06 ***
grpB          5.1166     1.4255   3.589  0.00245 **
grpA:ind2    -2.1297     1.2345  -1.725  0.10375
grpB:ind2     8.1588     1.2345   6.609 6.01e-06 ***
grpA:ind3    -0.6678     1.2345  -0.541  0.59601
grpB:ind3     4.8393     1.2345   3.920  0.00122 **
grpA:condy    2.3083     1.0080   2.290  0.03594 *
grpB:condy    1.3924     1.0080   1.381  0.18615
---