[Limma] Testing continuous variable with interaction
Entering edit mode
lanhuong ▴ 20
Last seen 4.1 years ago


I have a diet study experiment, where I would like to test whether a continuous variable, cont.var,  is related with gene expression data. 

1) I would like to find the effect of cont.var  within each category (GroupA/GroupB), and within each time-point (T0/T1). The gene expression data was collected at time T0  and then at time T1. I would like to test the effect of the continuous variable in each of the 4 groups (GroupA_T0, GroupA_T1, GroupB_T0, GroupB_T1).

2) Additionally, I have uneven number of observations for each subject. Each subject is either in GroupA or GroupB. So we have a confounding group and subject effect. Multiple but not equal number of samples are collected for each subjects at two time points, the number of observations per subject at a specific time point varies between 0 and 3. There are 70 subjects and 326 samples in total.

3) The might be a batch effect. And the information about sample batch (B1-B5) membership is known.

As I am interested in the effect of cont.var within each group and at each time point I started modeling the data as follows (rawcount contains the raw expression data):

design <- model.matrix(~ 0 + Lane + Group + Timepoint +  Timepoint:Group:cont.var, metadata)
v <- voom(rawcount, design, lib.size = libSize, plot = FALSE)
# Account for subject (individual) effect:
corfit <- duplicateCorrelation(v, design, block=metadata$SubjectID)
v <- voom(rawcount, design, lib.size = libSize, block = metadata$SubjectID,
               correlation = corfit$consensus, plot = plot)
fit <- lmFit(v, design, block = metadata$SubjectID,
               correlation = corfit$consensus)
fit <- lmFit(v, design)
fit <- contrasts.fit(fit, contrast)
fit <- eBayes(fit)


My question is whether my deisgn matrix was declared appropriately, and what should be the contrast parameter for contrasts.fit()?

I am interested in both:

a) whether there is an effect of the continuous  variable for a particular gene at all within each group and each timepoint. 

b) whether the effect of the continuous variable is different between 2 groups at each time point or between time points for each group individually.

I would really appreciate some help!








limma voom continuous interactions confounding • 1.4k views
Entering edit mode
Last seen 13 months ago
Scripps Research, La Jolla, CA

You were close but not quite right. You forgot the interaction term between Group and Timepoint. You didn't give the full metadata table, so I'll just make up my own example data:

metadata <- data.frame(
    Lane=sample(LETTERS[1:5], 326, replace=TRUE),
    SubjectID=sample(factor(1:70), 326, replace=TRUE),
    Timepoint=sample(c("0", "1"), 326, replace=TRUE),
metadata$Group <- factor(ifelse(metadata$SubjectID %in% 1:35, "A", "B"))
design <- model.matrix(~1 + Lane + Group + Timepoint + Timepoint:Group + Timepoint:Group:cont.var, metadata)

Now let's look at the column names of the resulting design matrix:

> colnames(design)
 [1] "(Intercept)"                "LaneB"                     
 [3] "LaneC"                      "LaneD"                     
 [5] "LaneE"                      "GroupB"                    
 [7] "Timepoint1"                 "GroupB:Timepoint1"         
 [9] "GroupA:Timepoint0:cont.var" "GroupB:Timepoint0:cont.var"
[11] "GroupA:Timepoint1:cont.var" "GroupB:Timepoint1:cont.var"

The coefficients that you're looking for are coefficients 9 through 12, the ones that look like "GroupX:TimepointY:cont.var". Each of these 4 coefficients represents the effect of cont.var in the specified group. You are effectively fitting 4 lines to the data, one for each group, with each line having its own slope and intercept term. These last 4 coefficients correspond to the slopes of those 4 lines, while the intercepts of each line are represented indirectly by the "(Intercept)" term and coefficients 6, 7, and 8. Notice that with your design, you will be lacking coefficent 8, "GroupB:Timepoint1", which means you are fitting 4 slopes but only 3 intercepts, which is likely not what you wanted.

In any case, you can test for a nonzero effect of the continuous variable in each group by testing each of the coefficients 9 through 12, and you can test for a differential effect between groups by forming contrasts between these coefficients using makeContrasts. You should replace the colons in the coefficient names with periods before running makeContrasts using the following code:

colnames(design) <- make.names(colnames(design))

On a side note, I'm confused as to how you can have multiple samples from the same subject at the some time point.

Entering edit mode
lanhuong ▴ 20
Last seen 4.1 years ago

Thank you very much for a thorough response! Things are much clearer now.

The reason I have multiple samples from the same subject per time point is that they are biological replicates taken at different days. So there are actually more than 2 time points. That is, samples are taken at a baseline time (T0) in 3 consecutive days, and then 3 weeks later (T1) also repeated in 3 days. Some of the samples were missing or people did not show up, hence different number of samples per subject.

One further question is that if I would like to fit a non linear trend to the data (splines) e.g. 

X <- ns(metadata$cont.var, df=5) and then:
design <- model.matrix(~1 + Lane + Group + Timepoint + Timepoint:Group + Timepoint:Group:X, metadata)

What p-values should I then look at to decide which genes are differently expressed based on the cont.var effect?

I would have 5 different coefficients for the continuous variable for each group/time point category is that right? Another question is whether the number of samples I have is sufficient for fitting df=5 splines?

I really appreciate your help!

Entering edit mode

It should be a pretty straightforward extension. After running your code:

X <- ns(metadata$cont.var, 5)
design <- model.matrix(~1 + Lane + Group + Timepoint + Timepoint:Group + Timepoint:Group:X, metadata)

Look at the column names:

> colnames(design)
 [1] "(Intercept)"          "LaneB"                "LaneC"               
 [4] "LaneD"                "LaneE"                "GroupB"              
 [7] "Timepoint1"           "GroupB:Timepoint1"    "GroupA:Timepoint0:X1"
[10] "GroupB:Timepoint0:X1" "GroupA:Timepoint1:X1" "GroupB:Timepoint1:X1"
[13] "GroupA:Timepoint0:X2" "GroupB:Timepoint0:X2" "GroupA:Timepoint1:X2"
[16] "GroupB:Timepoint1:X2" "GroupA:Timepoint0:X3" "GroupB:Timepoint0:X3"
[19] "GroupA:Timepoint1:X3" "GroupB:Timepoint1:X3" "GroupA:Timepoint0:X4"
[22] "GroupB:Timepoint0:X4" "GroupA:Timepoint1:X4" "GroupB:Timepoint1:X4"
[25] "GroupA:Timepoint0:X5" "GroupB:Timepoint0:X5" "GroupA:Timepoint1:X5"
[28] "GroupB:Timepoint1:X5"

The first 8 coefficients are the same: the lane, group, and timepoint effects. Then, for each combination of group and time point, you now have 5 coefficients, X1 through X5. To test the continuous variable in each group/time, just test the five coefficients for that combination. Pass them as a vector, e.g. c(9, 13, 17, 21, 25) for Group A at time 0. This will perform an ANOVA of all 5 coefficients, giving you a single p-value for overall evidence of nonlinear response to the continuous variable. To test for differences in the nonlinear response to cont.var between two groups, you will have to form the 5 contrasts between corresponding spline coefficients for the two groups and perform an ANOVA on these 5 coefficients.

As to whether you have enough data to fit splines with 5 df to your data, I honestly have no idea. I work primarily in RNA-seq and other similarly expensive data, so I've never had a data set large enough to make spline fits practical.

Note: Please use the "Add comment" button to reply to someone else's answer, instead of the "add your answer" box.


Login before adding your answer.

Traffic: 251 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6