EdgeR design matrix with 3 factors
1
1
Entering edit mode
Denniswu ▴ 20
@denniswu-14552
Last seen 4.7 years ago

Hello, I'm attempting to run a glm analysis with edgeR on 32 biologically independent samples, with 3 factors. I have sex, age, and genotype. I'd like to create a full design matrix, but I'm having some trouble doing this, and understanding all the contrasts possible.

> sex = rep(c("m","f"),each=2,times=8)
> time = rep(c("p0","p2","p6","p12"),each=8)
> geno = factor(rep(c("WT","KO"),times=16))
> geno = relevel(geno,ref="WT")
>
> full_design = model.matrix(~0+sex*geno*time)
> colnames(full_design)
[1] "sexf"                "sexm"                "genoKO"
[4] "timep12"             "timep2"              "timep6"
[7] "sexm:genoKO"         "sexm:timep12"        "sexm:timep2"
[10] "sexm:timep6"         "genoKO:timep12"      "genoKO:timep2"
[13] "genoKO:timep6"       "sexm:genoKO:timep12" "sexm:genoKO:timep2"
[16] "sexm:genoKO:timep6" 

From what I understand of the EdgeR manual, the first 6 fields of the design matrix are the coefficients of each factor, against the baseline of each other factor (so "sexf" is "f" at "p0", with a "WT" genotype, "timeP12" is "f" at "p12", with genotype "WT", and so on). Is this correct? And is this design matrix the best way to measure the differences between my samples? Thank you very much for your help.

EDIT: Looking at the design matrix, it seems that my previous understanding was mistaken. The first column, "sexf", is a length 32 vector, with 1s at all rows where the sex is "f" - so, equivalent to rep(c(0,1),each=2,times=8). In this case, EdgeR does have information on all samples that have sex "f", but I'm not sure how, in this case, one would find the difference between "sexf" and "sexm", at p0, since "sexm:timep0" doesn't exist in the "full_design" matrix. Am I correct in my understanding of the design matrix - labeling rows that apply to the column name with 1s? And if so, how does one make all possible comparisons with the "full_design" matrix? It seems like it isn't full at all.

edger glm rnaseq • 7.0k views
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 12 hours ago
The city by the bay

You can either learn from history, by reading Section 3.3 of the edgeR user's guide. Or you can learn from experience, by trying to interpret all the terms in your three-way interaction model. I've done it, and it's not pretty.

Long story short - paste all the factors together and use that as a single factor (without an intercept) in model.matrix. Each coefficient in the design matrix represents a single combination of factors, and all the contrasts can be directly interpreted as comparisons between (linear sums of) groups/combinations.

0
Entering edit mode

Thank you for your help! Would this design matrix, listing out each category as independent samples, provide the most information to EdgeR? I had previously thought that part of the reason to make design matrices was to allow EdgeR to control for effects of each of the factors alone.

Actually looking at full_design reveals that my assumption was incorrect: the first field of the design matrix, "sexf", is not "f" at "p0", with a "WT" genotype, it's all the female animals at each time point with all genotypes, so EdgeR presumably does control for individual levels of factors. However, given this, I believe the full_design matrix is incomplete. It would be impossible to look at sex/genotype effects at P0, for example. Should I just abandon the information that I could provide EdgeR by listing out shared characteristics between the groups?

Section 3.3.2 in the EdgeR User's Guide is confusing in this regard: it states that the second coefficient is the "baseline drug vs placebo comparison". Is this drug vs placebo at p0, or at all time points? I believe their design matrix, for the second coefficient would label all drugged samples as "1", and all non-drugged samples as "0". How would EdgeR know that the second coefficient is the comparison at hour 0, if it is that? And if "TreatDrug" is the drug effect at all time points, how would one find the drug effect at specifically hour 0? I thought I understood all this for a little while yesterday, but it seems that I was still mistaken.

Thank you again for the help, and for bearing with me.

1
Entering edit mode

The following design matrices are the same with respect to the fitted GLM:

group <- paste0(sex, geno, time, sep=".")
design1 <- model.matrix(~0 + group)
design2 <- model.matrix(~sex*geno*time)


There is no mathematical or statistical advantage to using one or the other, and you can always define a contrast for design2 that does the same thing as an equivalent contrast in design1 (and vice versa).

However, there is the minor fact that each of the terms in design2 require at least 5-10 minutes of inspection before you can figure out what they actually mean from a scientific/biological perspective. Do not be fooled, terms named "genoKO", "timep12", etc. are misleading and do not represent average effects of genotype or time, due to the presence of interaction terms in your model. (Even very experienced users can be momentarily confused, see the chain of comments following A: Different results in edgeR using simple vs GLM.) Use design1 and save yourself a major headache.

1
Entering edit mode

So even though design1 and design2 are not the same model matrix, the results I get from EdgeR will be the same? What do the columns in the model matrix actually mean in that case? In section 3.3.4 in the EdgeR User's Guide, the model matrix model.matrix(~Treat * Time, data=targets) is composed of:

   (Intercept) TreatDrug Time1h Time2h TreatDrug:Time1h TreatDrug:Time2h
1            1         0      0      0                0                0
2            1         0      0      0                0                0
3            1         0      1      0                0                0
4            1         0      1      0                0                0
5            1         0      0      1                0                0
6            1         0      0      1                0                0
7            1         1      0      0                0                0
8            1         1      0      0                0                0
9            1         1      1      0                1                0
10           1         1      1      0                1                0
11           1         1      0      1                0                1
12           1         1      0      1                0                1
attr(,"assign")
[1] 0 1 2 2 3 3
attr(,"contrasts")
attr(,"contrasts")$Treat [1] "contr.treatment" attr(,"contrasts")$Time
[1] "contr.treatment"

The second coefficient, "TreatDrug", is described as "the baseline drug vs placebo comparison". From what I understand of your comment, "TreatDrug" describes the difference between drug and placebo at time 0h, not the difference between drug and placebo at all times. However, I don't see how EdgeR is able to tell that "TreatDrug" is the difference only at time 0h. The "TreatDrug" column is c(0,0,0,0,0,0,1,1,1,1,1,1), with no indication of time point (indeed, a TreatDrug:Time0h column would make the matrix non-full rank). How does EdgeR make comparisons that appear impossible to make using the given design matrix?

Also, for what it's worth, I do get different results if I use the single factor method versus the full matrix. I'm comparing WT vs KO in this case. In the group model, I describe a contrast of rep(c(-1,1),times=8), and in the "full" model, I use a contrast of c(-1,-1,1,-1,-1,-1,1,-1,-1,-1,1,1,1,1,1,1). The first two coefficients in the "full" model, in particular, seem very unlike the other coefficients. A comparison of coefficient 1 and coefficient 16 results in fold-changes of 20-30 across the board. So I don't believe that "sexf" are "f" at time "p0" with genotype "WT". It seems to be a combination of all females in my study.

EDIT: After looking at your link, I think I'm starting to understand. I still don't really understand how EdgeR knows to use linear combinations of the other coefficients, or the math of the rationale in the linked answer. I'll try some more testing with the "full" design, comparing to the group design.

EDIT 2: After playing around with the data some more, I still am not seeing how design1 and design2 are supposed to give the same results. All the results of glmQLFTest different between glmQLFit's based off the two designs are different from each other. Comparisons of "genotypeKO" (which, according to what I'm to understand, is genotype KO at p0, for sex f) and "timep2" (which is genotype WT at p2 for sex f), compared against the pasted design, comparing "groupp0_f_KO" and "groupp2_f_WT".

It is also different for even the most seemingly explicit comparisons in the "full" design matrix. The results I get for a comparison between "sexm:genotypeKO:timep12" and "sexm:genotypeKO:timep2" are much different from a comparison between "groupp12_m_KO" and "groupp2_m_KO". It just isn't adding up.

EDIT 3: I found one case in which the results of design1 and design2 match up, which does verify the GLM between the two models is the same. It was using coef=3 for design2, and a contrast vector with index1 being 1, index2 being 2 (1*groupp0_f_KO, -1*groupp0_f_WT). However, similar comparisons for design2 and design1 still don't match up. I'll play even more with the data, but at this point I'm getting convinced I should just drop this and say "full" design matrices just aren't worth it.

1
Entering edit mode

I assure you, these matrices are equivalent, and you can always define a contrast in one that is equivalent to a contrast in the other. If you're getting different results, you aren't doing it correctly. Look:

# Stripped-down example from the user's guide:
set.seed(1000)
counts <- matrix(rpois(8000, 100), ncol=8)
Treat <- rep(c("Placebo", "Drug"), each=4)
Time <- rep(rep(c("0h", "2h"), each=2), 2)

# Using a nested interaction model.
design1 <- model.matrix(~Treat*Time)
y1 <- DGEList(counts)
y1 <- estimateDisp(y1, design1)
fit1 <- glmFit(y1, design1)
res1 <- glmLRT(fit1, coef="TreatPlacebo")
topTags(res1)

# Using a cell-means model.
groups <- paste0(Treat, ".", Time)
design2 <- model.matrix(~0 + groups)
y2 <- DGEList(counts)
y2 <- estimateDisp(y2, design2)
fit2 <- glmFit(y2, design2)
con <- makeContrasts(groupsPlacebo.0h - groupsDrug.0h, levels=design2)
res2 <- glmLRT(fit2, contrast=con)
topTags(res2)

You should find that these results are exactly the same. Obviously, though, it is much easier to tell what is happening with the cell-means model compared to the nested interaction model. The difference in the ease of interpretation is compounded when you have three-way interactions, as in your actual experiment.

You haven't shown any of the code you were using to set up the contrasts (read the posting guide), but I hope you're not literally using the same contrast matrix for the two design matrices. Clearly, this will not give the same results, because the meaning of the coefficients is totally different between the two designs. In fact, for three-way interactions, I often have to solve a linear system to identify a contrast in one design that is equivalent to a contrast in the other design; such an equivalent definitely exists but is not easy to determine from intuition alone when the design gets large.

To address some of your other questions; edgeR doesn't "know" anything. It simply performs whatever contrast it's been told to do. The onus is on you, the user, to understand which terms in the design matrix correspond to your scientific questions of interest. Hence the preference towards a cell-means model, which is easier to understand.

0
Entering edit mode

Hello, I am also confused with the example of edgeR Drug vs Placebo that you mention. I was wondering if you find a way to understand the interactions of edgeR, I am losing my mind on this.