Search
Question: DESeq2 design confusion with three factors
1
21 months ago by

Hello,

Any help would be much appreciated. I'm having trouble understanding how to compare all facets of my data.

Context:

I am analysing samples from two groups of patients ("A" or "B"). All patients are treated identically but only some respond positively to treatment ("P"). Samples are taken "before" and "after" treatment. The goal is to determine which genes are differentially expressed before compared to after, and to assess whether either the group or the response or both have any effect on this change in gene expression after treatment. I have looked around these question pages for a few days now but I cannot find an example like this (to my eyes).

In essence, the following data:

L = 6 # no of patients, 18 in the real data set
df <- data.frame("sample" = rep(paste("sample_", 1:L, sep=""),2),
"time" = rep(c("before","after"),1,each=L),
"group" = rep(c("A", "B")[round(runif(L,1,2))],2),
"response" = rep(c("P", "N")[round(runif(L,1,2))],2)
)
> df
sample   time group response
1  sample_1 before     B        P
2  sample_2 before     B        N
3  sample_3 before     A        N
4  sample_4 before     B        P
5  sample_5 before     B        P
6  sample_6 before     A        P
7  sample_1  after     B        P
8  sample_2  after     B        N
9  sample_3  after     A        N
10 sample_4  after     B        P
11 sample_5  after     B        P
12 sample_6  after     A        P

The base factor levels are "before", "A" and "N" (in df[2:4]) after re-levelling.

Previously I have combined the factors into various different combinations and performed contrasts between the groups for all manner of combinations.

df$c1 <- factor(paste(df$time, df$group, sep=".")) df$c2 <- factor(paste(df$time, df$response, sep="."))
df$c3 <- factor(paste(df$time, df$group, df$response, sep="."))


However, I may have lost power with this approach as some of the subsets only contain a few samples (there are only 18 patients shared between the two time-points). I have resorted to including interaction terms in the design but find it hard to interpret the results with three factors. Am I using the correct design?

Problem:

The current DESeq design:

design(dds) <- formula (~ response + group + time + group:time + response:time)

I am interested in several questions but face confusion with how they are answered. Are any of these approaches correct? If not, how should I formulate them?

1) Which genes are DE after treatment?

results(dds, name="time_after_vs_before")

Given the other base factors, is this actually asking "Which genes are differentially expressed after treatment in response N, group A samples?"?

2) Which genes are DE after treatment in P vs N samples, ignoring groupings?

results(dds, contrast=list(c("time_after_vs_before","responseP.timeafter")))

Or, does this refer to group A only?

3) Which genes are DE after treatment in response P vs N, in group B vs A?

results(dds, contrast = list(c("time_after_vs_before", "responseP.timeafter", "groupB.timeafter")))

Will this result in DE genes between "before" and "after" treatment, in Positive vs Negative patients, in group B against group A? i.e, The main treatment effect + additional effect of positive responding patients + additional effect of group B?

In addition to the treatment only effect, is this the DE result of only the positive response effect?

results(dds, name="responseP.timeafter")

Likewise,  does this show only the additional effect of group B on the treatment?

results(dds, name="groupB.timeafter")

4) Would this design answer any questions about the additional effect of group B upon the contrast of P vs N when not comparing "after" to "before" or do I need to add to the design with "+ response:group" to achieve this?

results(dds, contrast = list(c("response_P_vs_N", "responseP.groupB")))

Apologies if this is confusing.

Thanks for the help,

Vincent


modified 21 months ago by Gavin Kelly560 • written 21 months ago by vincent.knightschrijver10

so what is the final solution you got?

I am dealing with 2 treatment types, 2 response type, 2 tissue type, and 4-time points. It would be very helpful for me.

2
21 months ago by
Michael Love20k
United States
Michael Love20k wrote:

Hi,

I took a look at your design and your stated goal "to determine which genes are differentially expressed before compared to after, and to assess whether either the group or the response or both have any effect on this change in gene expression after treatment", and I'll just give advice from there (often this saves time than going down other paths).

I would recommend you follow the example in the vignette that describes group-specific condition effects, when the individuals are nested within groups:

http://master.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#nested-indiv

The tricky part is just that you have two groupings of individuals: A/B and N/P. I would recommend to run the analysis once for A/B and once for N/P. Each time you should be able to follow the example in the vignette. Make sure that individual (1-6) is coded as a factor and not a numeric.

Also, when you run it for N/P, you will have to build your own model.matrix, and remove two interaction columns (because you have 4 vs 2 for N/P, whereas it is balanced for A/B).

Sincere thanks Michael, I hadn't considered controlling for the individual effects. As each sample here is one patient, I assumed that by grouping the individuals I could treat each time-point as a shift in expression on a group / population level by solely taking into account the within-group variance of gene expression. So with this approach I should run multiple DESeq analyses to answer:

a) "to assess whether either the group...

The effect of treatment (time) on group (B vs A)

design(dds1) <- formula(~ group + group:individual_g + group:time)
a1 <- results(dds1, name="groupA.timeafter") # A only
a2 <- results(dds1, name="groupB.timeafter") # B only
a3 <- results(dds1, contrast=list("groupB.timeafter", "groupA.timeafter")) # B vs A

b) ...or the response...

The effect of treatment (time) on response (P vs N)

design(dds2) <- formula(~ response + response:individual_r + response:time)
b1 <- results(dds2, name="responseN.timeafter") # N only
b2 <- results(dds2, name="responseP.timeafter") # P only
b3 <- results(dds2, contrast=list("responseP.timeafter", "responseN.timeafter")) # P vs N

c) ...or both have any effect on this change in gene expression after treatment"

How would I compare the two if possible? Could I paste the two factors response and group together (eg, response_group) and run a third similar design (reassigning individuals on a per-combination basis)?

design(dds3) <- formula(~ response_group + response_group:individual_gr + response_group:time)
results(dds3, contrast=list("response_groupP_A.timeafter", "response_groupP_B.timeafter"))


For example, to see if there is an additional group effect upon the after-treatment expression in responsive patients.

Finally, can the above DESeq analysis be used to compare pre-treatment gene expressions of B vs A or P vs N? Example:

results(dds1, name="groupB")

Or, should this be in contrast with the additional effect of time?

Thank you for the patience, it's very much appreciated.

Vince

1

hi Vince,

the contrasts in (a) and (b) are correct. Except for (b) you won't be able to use a formula for the design, you'll have to supply a model matrix in which you remove the columns which are all zeros, there is some example code in the vignette for this.

For (c), you don't really have enough samples, and A vs B and N vs P are too confounded to answer this question in my opinion. P = 3 B's and 1 A, while N = 1 B and 1 A. In general you don't want statistical inference to ever hinge on individual samples. That means the analysis is most likely underpowered to find differences which could be there. I would stick to the above results tables.

Thanks again Michael.

1
21 months ago by
Gavin Kelly560
United Kingdom / London / Francis Crick Institute
Gavin Kelly560 wrote:

The way I visualise this is to draw out a labelled cube, single letters for main effects, double letters for interactions (there's no group:response interaction in the model, so those are omitted):

Your first 'results' takes you upwards from the origin, so only strictly looks at the time effect for the A group non-responders.

Your second requirement - ie 'after treatment in P vs N samples, ignoring groupings': 'after treatment' means we're looking at the top surface of the cube; 'p vs N' means we're looking at the comparison going across the page (on the top surface); and "ignoring groups" means we need the average effect of both edges that are running across the page on the top of the surface.  Algebraically, that's the average of (t + r + tr) - (t) and (t+g+r+tg+tr)-(t+g+tg).  Both of these are r+tr, so the average (ignoring groups) is r+tr.  As these terms are both positive, they both go in the first slot of the contrast list - which is exactly what you've got.

Question three is again entirely focused on the upper surface of the cube (after treatment), and as I understand it is looking at genes whose response:non-response effect is different dependent on group. You can do the algebra of ((t+r+tr) - (t)) - ((t+g+r+tg+tr) - (t+g+tr)), or you can realise that you haven't included a response:group interaction in your model, and so by definition your contrast is constrained to be zero.

I'll leave question four for the time being, as it also seems to involve a response:group interaction - but I hope the above is enough to get you started.  If anyone has better mental models for how these things work, I'd love to hear them.  Another way of dealing with things such as the 'ignoring groups' requirement, is to omit the group variable from the model, which does away with the need to average across contrasts (which I think needs to be done with consideration for the number of samples in each sub-contrast).

Sorry to comment on my own post, but I was assuming that the 'before' and 'after' are done on different individuals: all the phrasings of contrasts in your original post, and my responses to them, ignore any pairing. Michael's response correctly (obviously!) identifies this as an issue.   Once that's taken into account, you'll still need to interpret the coefficients carefully, so this answer may help a little anyway?