Question re "Group-specific condition effects, individuals nested within groups"
1
0
Entering edit mode
fnigsch • 0
@fnigsch-18127
Last seen 5.5 years ago

Hi,

I am working with a dataset that has precisely the nested structure as exemplified in the section "Group-specific condition effects, individuals nested within groups" of the DESeq2 RNAseq tutorial: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#group-specific-condition-effects-individuals-nested-within-groups

The structure is the same, but I do have more groups, a variable number of individuals per group, and two conditions per invidividual. (I therefore removed all the columns from the design matrix with all zeros for coefficients that do not exist.)

Working through my data, I fit a model exactly as the one listed in the tutorial:

model.matrix(~ grp + grp:ind.n + grp:cnd, coldata)

I got the results for the contrasts that I am interested in via results(). When inspecting the results, and subsequently playing with some toy data, I got some doubts about the correctness of what I did.

My toy example:

 

Two groups (A, B), three individuals in each (1, 2, 3), two conditions per individual (x, y), two replicates per (group, individual, condition), values of condition y are ~2 higher than condition x:

df <- data.frame(
    grp=factor(rep(c("A", "B"), each=12)),
    cond=factor(rep(rep(c("x", "y"), each=2), 6)),
    ind=factor(rep(1:3, each=4)),
    value=rnorm(24, 10) +
        c(rnorm(12, -5), rnorm(12, 5)) +
        ifelse(df$cond == "x", 0, rnorm(24, 2, 0.5))
)


I then fit two linear models. Here the first one, plus the coefficients:

lm(value ~ grp + grp:ind + grp:cond, df)

Call:
lm(formula = value ~ grp + grp:ind + grp:cond, data = df)

Coefficients:
(Intercept)         grpB    grpA:ind2    grpB:ind2    grpA:ind3    grpB:ind3   grpA:condy   grpB:condy  
     5.6597       9.6079      -0.1103       1.4499       0.3759       0.4865       2.1481       2.3591 

And here the second model, excluding the grp:ind term:

lm(value ~ grp + grp:cond, df)

Call:
lm(formula = value ~ grp + grp:cond, data = df)

Coefficients:
(Intercept)         grpB   grpA:condy   grpB:condy  
      5.748       10.165        2.148        2.359 

 

My questions to the above:

1) What is the meaning of the intercept in either of these models?

2) The condition-specific effects per group (grpA:condy, grpB:condy) are the same for both models. If it was taken into account that there are paired samples (for each individual) then this should not be the case. What am I missing?

Any help greatly appreciated!

deseq2 linear model • 962 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 44 minutes ago
United States

You can infer what the coefficients are, based on a bit of algebra. Note that the 0 and 1 entries in a model matrix indicate if a coefficient is estimated for a given sample or not. So for any set of rows that has only one 1, the coefficient for those rows is the mean of the group defined by the rows. Easier shown than described;

> mod1 <- model.matrix(~grp + grp:cond, df)
> mod2 <- model.matrix(~grp + grp:ind + grp:cond, df)
> mod1
   (Intercept) grpB grpA:condy grpB:condy
1            1    0          0          0
2            1    0          0          0
3            1    0          1          0
4            1    0          1          0
5            1    0          0          0
6            1    0          0          0
7            1    0          1          0
8            1    0          1          0
9            1    0          0          0
10           1    0          0          0
11           1    0          1          0
12           1    0          1          0
13           1    1          0          0
14           1    1          0          0
15           1    1          0          1
16           1    1          0          1
17           1    1          0          0
18           1    1          0          0
19           1    1          0          1
20           1    1          0          1
21           1    1          0          0
22           1    1          0          0
23           1    1          0          1
24           1    1          0          1

So if we extract the rows with just one 1 (these are the intercept-only rows), we get

> df[rowSums(mod1) == 1,]
   grp cond ind    value
1    A    x   1 5.422499
2    A    x   1 3.027077
5    A    x   2 3.360429
6    A    x   2 4.166690
9    A    x   3 4.230351
10   A    x   3 5.021791

And we can then infer that the intercept is the mean of the subjects from Group A, condition x, where we are computing the mean over all individuals. Figuring out the rest is simple algebra. For example, row 13 is an individual from Group B, condition x. If we substitute Grp_B_cond_x as shorthand, we have

Grp_B_cond_x = Grp_A_cond_x + X

Because that row has a coefficient for the intercept, and a coefficient for grpB. Solving for X gives us

X = Grp_B_cond_x - Grp_A_cond_x

So the 'grpB' coefficient is the difference between the Group B, condition x samples and the Group A, condition x samples. You can figure out the other two similarly (they are the difference between the Group A, condition y and Group A, condition x, and the same difference for the Group B samples).

You can figure out the coefficients for the larger model similarly (I leave that to the reader as an exercise), but what you will find is that for the larger model the grpA:condy and grpB:condy coefficients are computing the same thing as in the smaller model, which is why you get the same coefficient. But that's only half of the story! (Continued below)

ADD COMMENT
0
Entering edit mode

The denominator of the contrast(s) you will compute is based on the within-group variabiliity (which is supposed to tell you something about the within-population variability), and is used to determine if the t-statistic you compute is unexpectedly large, given the null distribution. If you use the smaller model that ignores the pairing, when you compute the within-group variability you ignore the fact that some of the subjects are the same person, and are expected to be correlated (e.g., have lower variance within subject as compared to between subject). In one sense that's cheating because you may have smaller variance estimates, and more degrees of freedom (you lose one degree of freedom for each coefficient estimated). On the other hand, if there are subject-specific differences that you might want to account for, you may have larger within-group variances than what you would get if you accounted for the pairing. As an example, let's add some sample-specific changes in the expression values

> df2 <- df
> df2$value <- df2$value + rep(c(3,-1,2,-3,5,1), each = 4)

And now if we fit the smaller model, we get

> summary(lm(value~grp+ grp:cond, df))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.2048     0.6972   6.031 6.76e-06 ***
grpB         10.7152     0.9859  10.868 7.67e-10 ***
grpA:condy    2.3083     0.9859   2.341   0.0297 *  
grpB:condy    1.3924     0.9859   1.412   0.1732    

> summary(lm(value~grp+ grp:cond, df2))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.538      1.268   4.368 0.000298 ***
grpB          10.382      1.793   5.790 1.15e-05 ***
grpA:condy     2.308      1.793   1.287 0.212701    
grpB:condy     1.392      1.793   0.776 0.446552    
---

Where the extra variability really messes things up if we ignore the pairing. But if we include the pairing, there's no problem

> summary(lm(value~grp+grp:ind + grp:cond, df))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.4706     1.0080   3.443  0.00334 **
grpB         11.1166     1.4255   7.799 7.71e-07 ***
grpA:ind2     1.8703     1.2345   1.515  0.14928    
grpB:ind2     0.1588     1.2345   0.129  0.89924    
grpA:ind3     0.3322     1.2345   0.269  0.79128    
grpB:ind3     0.8393     1.2345   0.680  0.50630    
grpA:condy    2.3083     1.0080   2.290  0.03594 *  
grpB:condy    1.3924     1.0080   1.381  0.18615    
---

> summary(lm(value~grp+grp:ind + grp:cond, df2))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.4706     1.0080   6.420 8.48e-06 ***
grpB          5.1166     1.4255   3.589  0.00245 **
grpA:ind2    -2.1297     1.2345  -1.725  0.10375    
grpB:ind2     8.1588     1.2345   6.609 6.01e-06 ***
grpA:ind3    -0.6678     1.2345  -0.541  0.59601    
grpB:ind3     4.8393     1.2345   3.920  0.00122 **
grpA:condy    2.3083     1.0080   2.290  0.03594 *  
grpB:condy    1.3924     1.0080   1.381  0.18615    
---
ADD REPLY

Login before adding your answer.

Traffic: 752 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6