[Limma] Testing continuous variable with interaction
2
1
Entering edit mode
lanhuong ▴ 20
@lanhuong-9878
Last seen 4.1 years ago

Hi,

I have a diet study experiment, where I would like to test whether a continuous variable, cont.var,  is related with gene expression data.

1) I would like to find the effect of cont.var  within each category (GroupA/GroupB), and within each time-point (T0/T1). The gene expression data was collected at time T0  and then at time T1. I would like to test the effect of the continuous variable in each of the 4 groups (GroupA_T0, GroupA_T1, GroupB_T0, GroupB_T1).

2) Additionally, I have uneven number of observations for each subject. Each subject is either in GroupA or GroupB. So we have a confounding group and subject effect. Multiple but not equal number of samples are collected for each subjects at two time points, the number of observations per subject at a specific time point varies between 0 and 3. There are 70 subjects and 326 samples in total.

3) The might be a batch effect. And the information about sample batch (B1-B5) membership is known.

As I am interested in the effect of cont.var within each group and at each time point I started modeling the data as follows (rawcount contains the raw expression data):

design <- model.matrix(~ 0 + Lane + Group + Timepoint +  Timepoint:Group:cont.var, metadata)
v <- voom(rawcount, design, lib.size = libSize, plot = FALSE)
# Account for subject (individual) effect:
corfit <- duplicateCorrelation(v, design, block=metadata$SubjectID) v <- voom(rawcount, design, lib.size = libSize, block = metadata$SubjectID,
correlation = corfit$consensus, plot = plot) fit <- lmFit(v, design, block = metadata$SubjectID,
correlation = corfit$consensus) fit <- lmFit(v, design) fit <- contrasts.fit(fit, contrast) fit <- eBayes(fit) My question is whether my deisgn matrix was declared appropriately, and what should be the contrast parameter for contrasts.fit()? I am interested in both: a) whether there is an effect of the continuous variable for a particular gene at all within each group and each timepoint. b) whether the effect of the continuous variable is different between 2 groups at each time point or between time points for each group individually. I would really appreciate some help! Thanks, Lan limma voom continuous interactions confounding • 1.4k views ADD COMMENT 3 Entering edit mode @ryan-c-thompson-5618 Last seen 13 months ago Scripps Research, La Jolla, CA You were close but not quite right. You forgot the interaction term between Group and Timepoint. You didn't give the full metadata table, so I'll just make up my own example data: metadata <- data.frame( Lane=sample(LETTERS[1:5], 326, replace=TRUE), SubjectID=sample(factor(1:70), 326, replace=TRUE), Timepoint=sample(c("0", "1"), 326, replace=TRUE), cont.var=rnorm(326)) metadata$Group <- factor(ifelse(metadata$SubjectID %in% 1:35, "A", "B")) design <- model.matrix(~1 + Lane + Group + Timepoint + Timepoint:Group + Timepoint:Group:cont.var, metadata)  Now let's look at the column names of the resulting design matrix: > colnames(design) [1] "(Intercept)" "LaneB" [3] "LaneC" "LaneD" [5] "LaneE" "GroupB" [7] "Timepoint1" "GroupB:Timepoint1" [9] "GroupA:Timepoint0:cont.var" "GroupB:Timepoint0:cont.var" [11] "GroupA:Timepoint1:cont.var" "GroupB:Timepoint1:cont.var" The coefficients that you're looking for are coefficients 9 through 12, the ones that look like "GroupX:TimepointY:cont.var". Each of these 4 coefficients represents the effect of cont.var in the specified group. You are effectively fitting 4 lines to the data, one for each group, with each line having its own slope and intercept term. These last 4 coefficients correspond to the slopes of those 4 lines, while the intercepts of each line are represented indirectly by the "(Intercept)" term and coefficients 6, 7, and 8. Notice that with your design, you will be lacking coefficent 8, "GroupB:Timepoint1", which means you are fitting 4 slopes but only 3 intercepts, which is likely not what you wanted. In any case, you can test for a nonzero effect of the continuous variable in each group by testing each of the coefficients 9 through 12, and you can test for a differential effect between groups by forming contrasts between these coefficients using makeContrasts. You should replace the colons in the coefficient names with periods before running makeContrasts using the following code: colnames(design) <- make.names(colnames(design)) On a side note, I'm confused as to how you can have multiple samples from the same subject at the some time point. ADD COMMENT 1 Entering edit mode lanhuong ▴ 20 @lanhuong-9878 Last seen 4.1 years ago Thank you very much for a thorough response! Things are much clearer now. The reason I have multiple samples from the same subject per time point is that they are biological replicates taken at different days. So there are actually more than 2 time points. That is, samples are taken at a baseline time (T0) in 3 consecutive days, and then 3 weeks later (T1) also repeated in 3 days. Some of the samples were missing or people did not show up, hence different number of samples per subject. One further question is that if I would like to fit a non linear trend to the data (splines) e.g. X <- ns(metadata$cont.var, df=5) and then:
design <- model.matrix(~1 + Lane + Group + Timepoint + Timepoint:Group + Timepoint:Group:X, metadata)

What p-values should I then look at to decide which genes are differently expressed based on the cont.var effect?

I would have 5 different coefficients for the continuous variable for each group/time point category is that right? Another question is whether the number of samples I have is sufficient for fitting df=5 splines?

1
Entering edit mode

It should be a pretty straightforward extension. After running your code:

X <- ns(metadata\$cont.var, 5)
design <- model.matrix(~1 + Lane + Group + Timepoint + Timepoint:Group + Timepoint:Group:X, metadata)

Look at the column names:

> colnames(design)
[1] "(Intercept)"          "LaneB"                "LaneC"
[4] "LaneD"                "LaneE"                "GroupB"
[7] "Timepoint1"           "GroupB:Timepoint1"    "GroupA:Timepoint0:X1"
[10] "GroupB:Timepoint0:X1" "GroupA:Timepoint1:X1" "GroupB:Timepoint1:X1"
[13] "GroupA:Timepoint0:X2" "GroupB:Timepoint0:X2" "GroupA:Timepoint1:X2"
[16] "GroupB:Timepoint1:X2" "GroupA:Timepoint0:X3" "GroupB:Timepoint0:X3"
[19] "GroupA:Timepoint1:X3" "GroupB:Timepoint1:X3" "GroupA:Timepoint0:X4"
[22] "GroupB:Timepoint0:X4" "GroupA:Timepoint1:X4" "GroupB:Timepoint1:X4"
[25] "GroupA:Timepoint0:X5" "GroupB:Timepoint0:X5" "GroupA:Timepoint1:X5"
[28] "GroupB:Timepoint1:X5"

The first 8 coefficients are the same: the lane, group, and timepoint effects. Then, for each combination of group and time point, you now have 5 coefficients, X1 through X5. To test the continuous variable in each group/time, just test the five coefficients for that combination. Pass them as a vector, e.g. c(9, 13, 17, 21, 25) for Group A at time 0. This will perform an ANOVA of all 5 coefficients, giving you a single p-value for overall evidence of nonlinear response to the continuous variable. To test for differences in the nonlinear response to cont.var between two groups, you will have to form the 5 contrasts between corresponding spline coefficients for the two groups and perform an ANOVA on these 5 coefficients.

As to whether you have enough data to fit splines with 5 df to your data, I honestly have no idea. I work primarily in RNA-seq and other similarly expensive data, so I've never had a data set large enough to make spline fits practical.