Question: LIMMA time series spline model.matrix design setup
0
8 months ago by
tleona30
tleona30 wrote:

I have reviewed the LIMMA manual (esp. sections 9.6.1 and 9.6.2), and need conceptual help in applying either this approach to my experimental design or suggestions to use an alternative approach.

My microarray RNA data is as follows: Two tissue types, five time points (0, 6, 24, 72, 168 hours), ~20 subjects, with each subject having up to 3 samples collected for each tissue at various time points (IRB restrictions). A wound is made and the time zero collection is the "baseline" expression profile for each tissue.

I wanted to ask the following questions but can not wrap my head around the best approach in terms of setting up the design matrix. Questions:

1) How do the tissues independently respond to wounding over time?

I could compare each tissue after wounding to it's baseline time 0hr doing various contrasts (M6-M0, M24-M0, M72-M0, etc), but the issue I find here is that the two tissues are completely different at time 0hr. This means my "control" or "baseline" that I'm comparing subsequent time points to is different between the two tissues and that I cannot really compare the results (like an intersect or overlap) between the two tissues. Would it make sense to group M0 and C0 as a control group?

Another approach is to compare changes over time using these contrasts (M6-M0, M24-M6, M72-M24, etc.) for each tissue. This leads to four comparisons for each tissue (8 total).

2) How do the two tissues differ in their response to wounding over time?

For question two I thought I could compare each tissue at each time point using contrasts (M0-C0, M6-C6, M24-C24, etc), but this is also a lot of comparisons (4) so I thought a spline based approach to model general changes over time would be more tractable. The positive aspect of the spline is I get changes in expression over time that follow different trends between tissues in one comparison. The downside is that the only way (I can think of) to group genes changing in the same manner over time is to cluster the significant results after adj.P.Val and lfc cutoff. Please see below:

    pData$X <- ns(pData$time, df=3)

#Setup model matrix and rename columns ":" to "."

design <- model.matrix(~0+tissue+tissue:X, eset)

colnames(design)
[1] "tissueC"    "tissueM"    "tissueC.X1" "tissueM.X1" "tissueC.X2" "tissueM.X2"
[7] "tissueC.X3" "tissueM.X3"

# Estimate the correlation between measurements made on the same subject:

corfit <- duplicateCorrelation(eset,design,block=eset$subject) corfit$consensus

# Fit the model
fit <- lmFit(eset,design,block=pData$subject,correlation=corfit$consensus)
fit2 <- eBayes(fit)
table1 <- topTable(fit2, coef=c(3:8), number=Inf, adjust.method="BH", sort.by="none")


3) I'm wondering how different is the table1 output (conceptually) than the table2 output from the code below in terms of what it is reporting?

cont.mat = makeContrasts(tissueM.X3-tissueC.X3, tissueM.X2-tissueC.X2, tissueM.X1-tissueC.X1, levels = design)
fit.cont =contrasts.fit(fit, cont.mat)
fit.cont=eBayes(fit.cont)
table2 <- topTable(fit.cont, number = Inf, adjust.method = "BH", sort.by = "none")


4) In order to look for genes changing over time in tissue M is the code below correct?

tableM <- topTable(fit2, coef=c(4, 6, 8), number = Inf, adjust.method = "BH", sort.by = "none")


5) In order to look for genes changing over time in tissue C is the code below correct?

tableC <- topTable(fit2, coef=c(3, 5, 7), number = Inf, adjust.method = "BH", sort.by = "none")

modified 8 months ago by Aaron Lun25k • written 8 months ago by tleona30
Answer: LIMMA time series spline model.matrix design setup
3
8 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

It helps if you set up something that we can play with.

# For simplicity, we'll just use 2 replicates
# per tissue and time-point combination.
tissue <- rep(c("M", "C"), each=10)
time <- rep(rep(c(0, 6, 24, 72, 168), each=2), 2)

# Assume one subject contributes to all time points
# for simplicity. I know this isn't actually the case,
# but this is fine provided that most subjects contribute
# to at least two samples.
subject <- c(rep(1:2, 5), rep(3:4, 5))


The obvious design matrix is:

g <- factor(paste0(tissue, time))
design <- model.matrix(~0 + g)
colnames(design) <- levels(g)


... with blocking on subject in duplicateCorrelation().

1) How do the tissues independently respond to wounding over time?

... I cannot really compare the results (like an intersect or overlap) between the two tissues.

Why not? The result of, e.g., M24-M0 or C24-C0 represents the tissue-specific wounding response at each time point. It doesn't matter that the baseline (time zero) is different; you're interested in the response! Baseline differences are not a problem, and in fact are the entire reason for collecting tissue-specific time zero samples in the first place.

An intersection is sensible if you want to find genes responding in both tissues. If you want to find differences, see 2.

Would it make sense to group M0 and C0 as a control group?

No.

Another approach is to compare changes over time using these contrasts (M6-M0, M24-M6, M72-M24, etc.) for each tissue.

I don't understand why you think this will solve your "baseline differences problem" (which, to be clear, is not a problem - see above). You're still doing M6-M0, for example.

This approach can be sensible if you specifically want to look for changes in consecutive time points, rather than comparing everything to the zero hour. The real motivation for choosing between these two approaches is not statistical (it's just some arithmetic to the linear model) but what you find more scientifically interesting/easier to interpret.

2) How do the two tissues differ in their response to wounding over time?

You could use a spline, but since we've set up our model already, we might as well stick with it. Let's start with the 6 hour.

con.6 <- makeContrasts((M6 - M0) - (C6 - C0), levels=design)


This tests the null hypothesis that the 6 hour response is different between tissues. You can repeat this process with each of the non-zero time points, giving you four comparisons. This is much better than your M0 - C0 approach which fails to account for differences in the baseline between tissues, i.e., you're not really testing for differences in the response.

The harder question is how to combine the four comparisons into a single "conclusion". If you just want to obtain a single DE list, you can do an ANOVA-like contrast:

anova.con <- makeContrasts((M6 - M0) - (C6 - C0),
(M24 - M0) - (C24 - C0),
(M72 - M0) - (C72 - C0),
(M168 - M0) - (C168 - C0), levels=design)


... and drop all coefficients in topTable(). This will identify genes with any difference in the responses at any time point.

If you want to classify genes into groups, you could do so via decideTests() with method="nestedF"; this will give you a matrix on which to categorize genes based on the presence (and direction) of a response at each time point. Note that this refers to responses relative to baseline, which allows you to create groups based on questions like "has this gene returned to time zero expression". If you test responses relative to the previous time point (see 1), you can create groups based on questions like "is this gene's expression going up or down compared to the last time point".

However, if you also want to explicitly consider the magnitude of the change, or if the possibility of obtaining 3^n groups (for n comparisons in anova.con) is too much to annotate and interpret, then there's little else to do but perform clustering on the log-fold changes reported by testing anova.con.

3) I'm wondering how different is the table1 output (conceptually) than the table2 output from the code below in terms of what it is reporting?

From reading your colnames(design), it seems that the first two terms represent the tissue-specific intercepts, and the remaining terms represent the tissue-specific spline terms. In that case:

• table1 tests the null hypothesis that all spline:tissue interaction terms are equal to zero, i.e., there is no response to time in either tissue.
• table2 tests the null hypothesis that corresponding spline:tissue terms are equal between tissues, i.e., any trend is the same between tissues.

4) In order to look for genes changing over time in tissue M is the code below correct?

Yes, if you were to use splines. Same for 5.

Hi Aaron, this is all great information and I really appreciate you taking the time to help! This particular step was a current bottleneck in my project as I wanted to get it right and I'm excited to move forward with it.

Just to clarify,

con.6 <- makeContrasts((M6 - M0) - (C6 - C0), levels=design)


This tests the null hypothesis that the 6 hour response is the same (not different) between tissues. The alternative hypothesis is that they are not the same, correct?

1

That is correct. The other time points follow the same principle.

For the results output from the following contrast listed (also above):

con.6 <- makeContrasts((M6 - M0) - (C6 - C0), levels=design)
res <- topTable(both.fit2, coef = "con.6", number = Inf, adjust.method = "BH")


In this results file, do the positive test statistics correspond to genes over expressed in M at 6hr (after baseline differences are accounted for) and negative test statistics represent C at 6hr or do I have this backwards? I just wanted to ensure I have it correct.

Hope its okay to post this here rather than creating a new post.

"This approach can be sensible if you specifically want to look for changes in consecutive time points, rather than comparing everything to the zero hour."

When using this model (time1-time0, time2-time1, time3-time2... etc) limma seems to give me genes differentially expressed at any time point. Is this the correct way to identify genes which, for example, are sequentially upregulated more and more with time?