I have some RNA-seq datasets from four genotypes (A, B, C, D) and two conditions (time1 and time2). Each genotype in each condition has three replicates. I would like to compare whether log2(A/B) equals log2(C/D) or not. Does someone have suggestions on how to implement this test in DESeq2? And how to set a contrast.
Thanks for a reply. I think it's possible to split the conditions (different time point) into two separate datasets and let one dataset has only one factor (genotype) ; I read the DESeq2 manual, but it seems most examples are dealing with lineal model, like comparing (A+B) to (C+D). But I wonder if it is possible to compare log2(A/B) to log2(C/D) and get adjusted p values (test equality of the ratios at each timepoint)?
So, the easiest way to test, e.g. B/A = D/C is using interaction terms. First you need to code some new variables. Here I create a little toy dataset which doesn't have the 3 replicates per genotype x condition, but you'll see how to do that same to your colData. Create two variables which each divide A,B,C,D into two groups. Maybe you can give some more meaningful names here than geno2 and geno3 (for example, some description of whatever A and C have in common, and the same for whatever A and B have in common).
> d <- expand.grid(geno=letters[1:4], time=c("1","2"))
> d
geno time
1 a 1
2 b 1
3 c 1
4 d 1
5 a 2
6 b 2
7 c 2
8 d 2
> d$geno2 <- ifelse(d$geno %in% c("a","c"), "1", "2")
> d
geno time geno2
1 a 1 1
2 b 1 2
3 c 1 1
4 d 1 2
5 a 2 1
6 b 2 2
7 c 2 1
8 d 2 2
> d$geno3 <- ifelse(d$geno %in% c("a","b"), "1", "2")
> d
geno time geno2 geno3
1 a 1 1 1
2 b 1 2 1
3 c 1 1 2
4 d 1 2 2
5 a 2 1 1
6 b 2 2 1
7 c 2 1 2
8 d 2 2 2
Now if you didn't have two time points, you would just use a design ~geno2 + geno3 + geno2:geno3 and run DESeq() and then test the interaction term using something like results(dds, name="geno22.geno32") ... you'll have to look at resultsNames(dds) to see what the name of the interaction term is for your dataset. The interaction term will have a low p-value if there is a small probability of seeing such an large difference in observed fold changes under the null hypothesis: B/A = D/C.
However, you can simply add time to the design in the following way to create more model terms to account for the two time points, including two interaction terms:
~ time + time:geno2 + time:geno3 + time:geno2:geno3
Now you can test name="time1.geno22.geno32" for B/A = D/C at time 1, and likewise for time 2. See the names in resultsNames(dds).
Note: I'm assuming that all of these terms are factors. You can make sure with: class(dds$time), etc.
How do you interpret the log2foldchanges in the results table for testing the interaction term? It seem like the larger the difference between fold changes, the larger the log2foldchange is in the results table. I'm not sure about the sign of the log2foldchange.
Take a look at the diagram of interactions in the vignette. The interaction is the difference in the effect size between groups, where it compares to the reference level (e.g. Control group or WT or whatever you specified). So + means "bigger effect than in reference group", and - means "smaller effect", and 0 means the effect is the same size as in the reference group.
Note that by bigger and smaller I refer to the signed value, not the absolute value. So if the main effect is negative, and the interaction term is negative, this means the effect is larger in absolute value (a bigger negative number) relative to the reference group.
Hi Mike, thank you very much for your detail suggestion. I have tested running this design with my data. Just have a couple of more questions to be classified: 1) the interaction term geno2:geno3 is to test if the effect of geno2 is equal or different in geno3 1 vs geno3 2? if equal, "geno22.geno32" is non-significant and no geno2:geno3 effect in defined model? 2) the results (adj p values) of "geno22.geno32" is same to "geno21.geno31 (fold changes is minus), so not shown in resultsNames()? 3) for replicates (e.g. b1, b2, b3), I can simply use same factor numbers in geno2 (as "2") or geno3 (as "1"), or I need a new variable?
1) yes the interaction tests if the ratio is equal. note that testing
B/A = D/C
is identical to testing
B/D = A/C
Non-significance does not imply equality, for example, you could have an underpowered experiment. This is a general rule in frequentist statistics: p-value > alpha does not necessarily mean you have evidence that the null is true.
2) There is a single interaction term for the test of the ratios being equal. You can control which levels are in the denominator of the fold change by setting the reference level of the factors using relevel() (see vignette).
3) My code above for a toy dataset does not have replicates, but these would just be repeated rows of exactly what I have there. You would not define new variables.
Thanks very much for the classification. I am bad in this statistical thoughts. So p-value<alpha we can reject the null hypothesis of the quality, but p-value>alpha we can NOT accept the hypothesis. Does this mean we need another test and set the null hypothesis to be B/A ≠ D/C and see if this to be rejected or not? and how to design this in DESeq2?
Yes in fact you can use another test for small fold changes, by setting a different null hypothesis.
For background, you should definitely read the DESeq2 paper, the section on hypothesis tests with thresholds and in particular "Specifying maximum effect size".
Then you can use lfcThresold and altHypothesis="lessAbs". (See the vignette section on threshold tests or ?results for details on how to do this using DESeq2.)
Great, I will try disabling the LFC shrinkage, using the same design but altHypothesis. I think lfcThresold is same with the default (0) or weak (e.g. 0.05) for testing the equality. Many thanks.
hi Zhu,
You don't need to split into separate datasets.
So, the easiest way to test, e.g. B/A = D/C is using interaction terms. First you need to code some new variables. Here I create a little toy dataset which doesn't have the 3 replicates per genotype x condition, but you'll see how to do that same to your colData. Create two variables which each divide A,B,C,D into two groups. Maybe you can give some more meaningful names here than geno2 and geno3 (for example, some description of whatever A and C have in common, and the same for whatever A and B have in common).
Now if you didn't have two time points, you would just use a design ~geno2 + geno3 + geno2:geno3 and run DESeq() and then test the interaction term using something like results(dds, name="geno22.geno32") ... you'll have to look at resultsNames(dds) to see what the name of the interaction term is for your dataset. The interaction term will have a low p-value if there is a small probability of seeing such an large difference in observed fold changes under the null hypothesis: B/A = D/C.
However, you can simply add time to the design in the following way to create more model terms to account for the two time points, including two interaction terms:
Now you can test name="time1.geno22.geno32" for B/A = D/C at time 1, and likewise for time 2. See the names in resultsNames(dds).
Note: I'm assuming that all of these terms are factors. You can make sure with: class(dds$time), etc.
Hi Michael,
How do you interpret the log2foldchanges in the results table for testing the interaction term? It seem like the larger the difference between fold changes, the larger the log2foldchange is in the results table. I'm not sure about the sign of the log2foldchange.
Take a look at the diagram of interactions in the vignette. The interaction is the difference in the effect size between groups, where it compares to the reference level (e.g. Control group or WT or whatever you specified). So + means "bigger effect than in reference group", and - means "smaller effect", and 0 means the effect is the same size as in the reference group.
Note that by bigger and smaller I refer to the signed value, not the absolute value. So if the main effect is negative, and the interaction term is negative, this means the effect is larger in absolute value (a bigger negative number) relative to the reference group.