Question

On the relationship between design and contrasts

0

Entering edit mode

kjo ▴ 70

@kjo-11078

Last seen 7.3 years ago

This is a radically simplified version of: Confused by "composite" designs. (After I read the comments and answers that earlier post received, I realized that my query had conflated several different questions, which made it pretty confusing. In this post I attempt to narrow the focus to only one question, using a simpler, even if less realistic, example. The phrasing of this second attempt is sufficiently different from that of the first one that I thought it would be too confusing to just edit or update the original post in-place. Hence this second post.)

This is a hypothetical example. I have 10 samples, as described in the metadata table below:

    > metadata
       drug dose
    1  NONE  NaN
    2     A    1
    3     A   10
    4     A  100
    5     B    1
    6     B   10
    7     B  100
    8     C    1
    9     C   10
    10    C  100

Now, I create two deseq datasets, with different designs:

    dds1 <- DESeq(DESeqDataSetFromMatrix(countData = counts,
                                            colData = metadata,
                                            design = ~ drug))

    dds2 <- DESeq(DESeqDataSetFromMatrix(countData = counts,
                                            colData = metadata,
                                            design = ~ drug + dose))

Finally, I compute two sets of results, as follows:

    results1 <- list(A = results(dds1, contrast = c("drugA", "drugNONE")),
                     B = results(dds1, contrast = c("drugB", "drugNONE")),
                     C = results(dds1, contrast = c("drugC", "drugNONE")))

    results2 <- list(A = results(dds2, contrast = c("drugA", "drugNONE")),
                     B = results(dds2, contrast = c("drugB", "drugNONE")),
                     C = results(dds2, contrast = c("drugC", "drugNONE")))

The two RHS expressions above are identical, except that one uses dds1 (hence, design = ~ drug) and the other on uses dds2 (hence, design = ~ drug + dose).

How do results1 and results2 differ? I don't mean, how do their values differ, but rather how do they differ semantically?

What I'm trying to get at is whether the interpretation of the expression contrast = c("drugA", "drugNONE") is context-dependent or not, and if it is, how.

Here, the only context differences I'm interested in are those resulting from differences in the formulae used as values for the design parameter.

(Of course, in some trivial ways, the meaning of the expression contrast = c("drugA", "drugNONE") is very much context dependent. For example, if

    dds0 <- DESeq(DESeqDataSetFromMatrix(countData = counts,
                                            colData = metadata,
                                            design = ~ dose))

...then the expression results(dds0, contrast = c("drugA", "drugNONE")) is downright invalid, so in this case it may be fair to say that the expression contrast = c("drugA", "drugNONE") is basically meaningless.)

As a further wrinkle, suppose that I now define

    dds3 <- DESeq(ESeqDataSetFromMatrix(countData = counts,
                                           colData = metadata,
                                           design = ~ dose + drug))

    results3 <- list(A = results(dds3, contrast = c("drugA", "drugNONE")),
                     B = results(dds3, contrast = c("drugB", "drugNONE")),
                     C = results(dds3, contrast = c("drugC", "drugNONE")))

Now, the only difference between dds2 and dds3 is in the ordering of the factors in the design formulas. Is there any semantic difference between results2 and results3?

deseq2 • 791 views

ADD COMMENT • link updated 7.6 years ago by Ryan C. Thompson ★ 7.9k • written 7.6 years ago by kjo ▴ 70

score 3 · Accepted Answer · 2016-09-15

Yes, the meaning of a contrast depends on the design. In results1, since dose was not included in the design, you are essentially treating your samples as 4 groups. All the drug A samples are considered equivalent to each other regardless of dose, as are all the drug B and drug C samples. This design would be useful for a drug effect that was not strongly dose-dependent, which I'm assuming is not the case here, so this design is probably not what you want for this analysis.

In results2, you are adding a dose effect that is independent of the drug. This means that each dose is assumed to have the same effect regardless of which drug is being added, which, again, is probably not what you want. The results from this design would represent the differences between each drug and the untreated sample after subtracting out the average effect of all 3 drugs at each dose. For example, the contrast for drug A would find genes that were consistently different in A compared to the average of all 3 drugs at each dose level. This is generally not a biologically meaningful contrast.

In your original question, Michael Love recommended generating the interaction of the two, something like:

metadata$condition <- interaction(metadata$drug, metadata$dose, drop=TRUE, sep="")

And then using ~condition as your design. This won't work for the example you give here, since this design yields 10 groups with 1 sample in each group, which is a design with no replicates. But if you have more samples at each combination of drug and dose, as in your original example, this will work just fine. If those additional samples are grouped into batches, then you should also add the batch effect to the design, e.g. ~condition+batch. This will subtract out differences between the average of each batch before performing contrasts between conditions.

Note: If your dose column is a numeric vector, then including it in your design will attempt to relate the dose and the gene's log counts in a strictly linear fashion. This constraint is usually too restrictive for most designs, especially when the doses range over multiple orders of magnitude, as is the case in your experiment. Usually, you should convert any numeric variables to factors, unless you specifically want a linear relationship between the numeric value and log count. Treating the dose as a factor allows each dose to have a different effect. For example, in drug A, a particular gene could go up at a dose of 1, up again at a dose of 10, but then come back down at a dose of 100.