Hello everyone,
I am working on RNA-Seq data. I have 2 conditions (C = control, T = treatment), 4 time points (1, 5, 8 and 10) and 2 or 3 replicates (depending on the quality of the libraries) for each {condition, time point}
I am interested in the genes whose the expression patterns are different in the control and the treatment over time. It means that I want the genes whose the Fold change is different from 1 in at least one time point.
To this end, I build the model :
~ condi + time + time:condi
where condi
is my condition (C or T) and time
is the time point (1, 5, 8 and 10)
so that my design matrix is :
Intercept condiT time5 time8 time10 condiT.time5 condiT.time8 condiT.time10
1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0
1 0 1 0 0 0 0 0
1 0 1 0 0 0 0 0
1 1 1 0 0 1 0 0
1 1 1 0 0 1 0 0
1 1 1 0 0 1 0 0
1 0 0 1 0 0 0 0
1 0 0 1 0 0 0 0
1 0 0 1 0 0 0 0
1 1 0 1 0 0 1 0
1 1 0 1 0 0 1 0
1 0 0 0 1 0 0 0
1 0 0 0 1 0 0 0
1 0 0 0 1 0 0 0
1 1 0 0 1 0 0 1
1 1 0 0 1 0 0 1
1 1 0 0 1 0 0 1
Note that at time point {1}, my control (C) and my treatments (T) are biologically identical (they are actual replicates).
It means that the differences between C and T at a given time point (5,8 or 10) is modeled by the the interaction term condiT.time(5, 8, or 10), as the condiT
term should be 0 for almost every gene. So, I don't need to build a contrast matrix.
For sanity checks, I look at the importance of the condiT
term, It gives me only one gene for which this condiT
term is significantly different from 0 (pval<0.05), and I am super happy with this.
code : topTable(efit, coef=c(2), n=Inf, p.value=0.05) # the coeff '2' is the condiT term
If I check the importance of the condi.Time
terms (which is what I am interested in), It gives me 16 genes.
code : topTable(efit, coef=c(6:8), n=Inf, p.value=0.05) # the coeff '(6:8)' are the interaction terms
As this is a very few genes, I tried to see what was going on, so I added the condiT
tem to the test :
code : topTable(efit, coef=c(2,6:8), n=Inf, p.value=0.05) # same line of code as the one above but with '(2,6:8)' instead of '(6:8)'
And this gives me 1531 genes for which this four terms are important (pval<0.05) to model the gene expression...
I just don't get how 16 genes become 1531 genes by adding a term that is not significant taken independently.
Would someone be able to help me ? I am very confused and I have no idea what to trust. Thanks a lot in advance.
Here is my code, if you want to have a look:
design <- model.matrix( ~ condi + time + time:condi , data=coldata)
colnames(design) <- str_replace_all(colnames(design),":",".") %>% str_replace_all(.,"[()]","") # remove parenthesis from Intercept and change ':' to '.'
idx <- filterByExpr(gse, design) # gse is my expression matrix
dds <- gse[idx,,keep.lib.size=FALSE]
dds <- calcNormFactors(dds)
v <- voom(dds, design)
fit <- lmFit(v, design)
efit <- eBayes(fit)
Best,
Arnaud Stigliani
University of Copenhagen
Dear James,
Thanks a lot for your answer.
Can you elaborate when you say 'you are testing for any gene that has a coefficient that is different between the three interaction terms and the treatment main effect' ?
I thought that I was testing this model :
Against this one
And I was surprised by the fact that this
condiT
influences the results as is seems to me that it is equal to 0 (Basically, I thought that it was useless to model the gene expression).Best,
Arnaud
Could you elaborate on why the interpretation of c(2,6:8) is not just if the coefficient is different from zero in any of the 4 terms?
Well, what I said isn't 100% correct. An F-test in conventional ANOVA does just what I said - testing to see if the mean of any groups is different. Not testing if the mean of any group is equal to zero.
However, in this context none of the coefficients you are estimating is the mean of a group, but instead is a contrast of some sort. So what limma does is use a function called
classifyTestsF
, which is intended to compute an F-statistic using all the underlying t-statistics for each coefficient, with the goal of detecting any situations where one or more of the underlying t-statistics are significant. This is still pretty much the same thing as a conventional F-test, because if any of the t-statistics are significant, then by definition there are differences in the means of each group.So why you get more significant genes using the F-test including the main effect is probably due to increased power due to the increased degrees of freedom? I don't know. In general I stay away from this sort of model formulation, because there's no profit in it. You really have 8 groups (four time points x two treatments), and in general I would just compute a cell means model by combining them
You can make any contrast you care to make, without having to deal with a treatment-contrasts design.
Hi James. Thanks for the explanation it really helped although I guess the increased power is still a mystery. I'm sorry I'm not familiar enough with contrast fits but would that approach also result in using a F-test or in 4 individual comparisons? I like the former approach as the result is a single p-value instead of multiple.
Why do you want a single p-value? The goal is usually to find genes that are varying at a particular time point. I guess you could detect genes that vary at any of the time points, but then the next question is 'but which one?' at which point you have to do the individual comparisons anyway, so why not start there? Anyway, it's simple
Now the first three coefficients are your interaction terms, and the last one is the main effect, testing if the average T is different from the average N. If you want the F-test, you can still do that. But you can also see that the F-test using all four coefficients is sort of weird, because three are doing similar tests and the last one isn't. I'm not smart enough to know if that is a problem or not, but it looks weird to me so I probably wouldn't do it to begin with.
Dear James,
I thank you very much for your explanations and your detailed answers. I think that you helped me a lot ! I will try to go for a contrast matrix, which seems more straightforward for my analysis.
Best, Arnaud
Still, I am wondering something about this contrast :
Does it mean that I am comparing the mean of all the genes expressed in T vs N ? Or does it mean that I am testing at the same time 4 pair-wise comparisons ?
Thank you !
I think your definitions of the comparison are really the same thing, algebraically. The way I defined the contrast, it's the average of the T samples as compared to the N samples. But it's a simple agebraic manipulation to change to
In which case it's the average of the within timepoint differences. But it's the same thing! We have just used words to describe it in a slightly different way, but algebraically they are identical.
Thanks again for your answers that were super helpful !!
Best, Arnaud