Limma for differential expression in a time-course
1
0
Entering edit mode
Jake • 0
@jake-13924
Last seen 4.4 years ago

I have a question related to example 9.6.2 in the Limma User's Guide. The example describes how to find genes whose behavior over time is different between two conditions, where each condition has a decent number of time-points.

Reproduced here:

X <- splines::ns(targets$Time, df=5) Group <- factor(targets$Group)
design <- model.matrix(~Group*X)
fit <- lmFit(y, design)
fit <- eBayes(fit)
topTable(fit, coef=8:12)


Columns 8:12 of the design matrix correspond to the interaction terms between group and time.

What if I also want to find the genes whose average expression differs between conditions, adjusting for time? Is there a way to do this using the above design matrix? Currently I'm proceeding as follows:

X <- splines::ns(targets$Time, df=5) Group <- factor(targets$Group)
design <- model.matrix(~Group + X)
fit <- lmFit(y, design)
fit <- eBayes(fit)
topTable(fit, coef=2)


Column 2 of the design matrix corresponds to group. This method appears to give sensible results, but any advice would be appreciated. Thanks.

Jake

2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 7 hours ago
The city by the bay

The interaction model is not particularly helpful when testing for differences in average expression between groups. If you believe that a non-zero interaction can exist, then a difference in average expression becomes difficult to interpret; it could be driven by a shift in the "baseline" between groups with the same effect of time, or it could be driven by group-specific effects of time without any shift in the baselines. At this rate, you might as well test for any type of difference between groups, using an ANOVA-like comparison with coef=c(2,8:12).

The conventional way to handle this type of problem is to fit an interaction model, verify that the interaction term is not significant, and then fit an additive model to test the main effects. While you can do this for genome-wide analyses, it tends to be a bit of a pain as you have to choose a threshold to decide "non-significance" (which is the opposite of how testing procedures are designed to be used). I usually just try to answer my scientific question with a different statistical contrast that is more amenable, e.g., using an ANODEV to look for any difference between groups rather than focusing on differences in the averages.

I should also point out that, technically, your second approach will test for differences in expression between groups at the first time point. (Which is equivalent to a test for differences at any time point, as the additive model assumes that the effect of time is the same.) The concept of an "average expression" for a group becomes a bit woolly when you have blocking factors.

0
Entering edit mode

Thanks for the quick and detailed response! A few questions:

1. I still care about distinguishing (1) genes for which the difference in expression between conditions is non-zero and constant over time from (2) genes for which the difference in expression between conditions changes as a function of time. Would it be reasonable to compare, for each gene, the p-values that result from coef(2, 8:12) and from coef(8:12)?

2. In the spline scenario here, how does one extract the genes that show a change over time in one particular group? I've looked through the user's guide and support forum, but have found seemingly contradictory information.

3. I don't completely understand how, as you say, coef=2 will test for differences in expression between groups at the first time point. Let's say my experiment doesn't have discrete time-points, just values of time that are randomly distributed between 0 and 10 and are different between groups. What happens then?

0
Entering edit mode
1. Don't compare p-values between tests, that rarely makes any sense. If you want to detect genes for scenario 1, then you'll have to use an additive model (and make sure that you ignore genes with significant interaction effects). If you want scenario 2, then coef=8:12 in the interaction model is the way to go.
2. To test for any effect of time in the first group, you would specify coef=3:7. To do the same for the second group, you would need to set up an ANOVA to test whether coefficient 3 plus coefficient 8 is equal to zero and coefficients 4 plus 9 are equal to zero and coefficients 5 plus 10 are equal to zero, etc. via makeContrasts.
3. The first time point would be the earliest time point across all groups - even if that time point is not actually sampled in some groups! Yes, it seems funny, but it's true. Have a look at the values of your spline basis matrix and find the row at which the values of the matrix are all equal to zero. This is the point at which the time effect is "removed", so to speak (as the spline coefficients contribute nothing to the linear equation), which is when the other coefficients can be directly interpreted. You can imagine that the model "extrapolates" to the earliest time point if it doesn't exist in a group.
0
Entering edit mode

Sorry, I'm still confused. For detecting genes for scenario 1, I thought "model.matrix(~Group + X)" was an additive model, but you're saying coef=2 only looks at the first time-point. Do I need to use makeContrasts to look at the effect of group at each time-point? For randomly distributed time-points, it seems like this could get messy.

0
Entering edit mode

In the additive model, looking at a shift in expression between groups at the first time point is the same as looking at a shift at any time point, because the time effect is the same between groups. So if there's a log-fold change of 2 at the first time point, there will also be a log-fold change of 2 at any other time point. I mention the first time point as this is the literal interpretation of what the coefficient represents (and thus, what the test is actually doing).

0
Entering edit mode

But limma does use all the samples from all the time-points in order to make that estimate, right? As part of the "extrapolation" you mentioned? I'm just trying to wrap my head around this.

0
Entering edit mode

Estimate... of what? Of the second coefficient, i.e., the log-fold change between groups? If so, yes, the fit will use information across all samples at all time points to estimate the coefficient. However, the most informative samples will be those in the time intervals that have samples from both groups.

0
Entering edit mode

Yes, that's what I meant, thanks.