Question

LIMMA: interpreting interaction terms in an unbalanced factorial design with repeated measures

0

Entering edit mode

robert.maughan • 0

@robertmaughan-8096

Last seen 7.4 years ago

Ireland

Dear all,

I have searched for quite a while for questions relating to the following but couldn't find anything.

I am trying to analyse gene expression data from quite a challenging study design. The study was a non-randomised study in which 25 subjects were treated with a combination of two drugs where they were either treated with both, treated with one or other or treated with neither. Samples were taken from patients before treatment and after treatment. To me this represents a 2x2 factorial design with a repeated measure. Unfortunately due to the non-randomised observational nature of the study, the design is unbalanced (i.e. there are unequal numbers of patients in each treatment subgroup). The subgroup assignment is as follows:

	Drug A +	Drug A -
Drug B +	6	2
Drug B -	9	7

The primary comparisons of interest are:

Differences in the gene expression changes between subjects treated and not treated with Drug A
Differences in the gene expression changes between subjects treated and not treated with Drug B

However I would also like to control for the interaction effect of the presence or absence of the other drug in a given comparison since I believe that there will inevitably be significant interaction effect for some of the genes. In other words in defining the effect of Drug A we would like to take into account whether the Drug A effect is changing whether the subject is receiving Drug B or not.

From my searching on the topic, the conventional method to approach such a design would be a Two-Way factorial ANOVA with repeated measures. However, for reasons I do not fully understand repeated measures ANOVA is not recommended for unbalanced study designs even if SSIII testing is used for F tests. Multiple online sources and consultation with a statistician recommends the use of mixed-effect models for the analysis of unbalanced designs using something like the nlme R package. The statistician also recommended the following model:

gene~TIME*drugA*drugB, random=~1|SUBJECT

Where TIME represents the timepoint (pre and post), drugA and drugB are binary vector indicating subject's treatment assignment and the random argument specifies that the linear model includes a random intercept for each subject. This returns coefficients estimating the "main effects" of drugA and drugB over time and whether there is an interaction between drugB on the effect of drugA over time. Thereby allowing me to test for the specific effect of each treatment and account for the interaction of the other drug.

Now to my question! Since I would rather use a well established method for gene expression analysis such as LIMMA for the analysis of this study and I would rather wish to avoid implementing a mixed model for every gene, would it be appropriate to use LIMMA for this? I have actually gone ahead and carried out this analysis in LIMMA and verified that it returns the same coefficients. I did this using the following:

design<-model.matrix(~TIME*drugA*drugB)
corfit <- duplicateCorrelation(hamaset,design,block=patient)
corfit$consensus
fit <- lmFit(hamaset,design,block=patient,correlation=corfit$consensus)

Note: I used the duplicateCorrelation function to block for inter-subject differences for a paired approach.

So more specifically I'm asking is it appropriate to use LIMMA to test for interaction effects in this unbalanced design using the above method? Will the unequal number of subjects in the treatment subgroups result in biased or violations of the assumptions implemented in these interaction tests?

Any advice or comments on the question or carrying out this type of analysis would be much appreciated.

Best,

Rob

LIMMA FACTORIAL REPEATED MEASURES differential gene expression • 5.0k views

ADD COMMENT • link 8.7 years ago • updated 8.5 years ago robert.maughan • 0

0

Entering edit mode

You say have 25 subjects, but the numbers allocated to each subgroup don't match up, i.e., 6+2+9+7 = 24. Also, is the number of samples twice the number of subjects, because you have before and after samples for each subject?

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Thanks for your reply. Yes you're correct that was an oversight on my part, at the end of the study we only had usable samples from 24 subjects although 25 were recruited. For simplicity's sake let's take the total number of subjects as 24.

Again for the sake of the example I gave lets say that yes there are 48 samples: 24 baseline and 24 post-treatment. I actually have 2 post-treatment timepoints and there are some missing observations giving me a total sample number of 66. However, I intend to do the baseline vs timepoint 1 and baseline vs timepoint 2 analyses separately in order to account for early and late changes.

ADD REPLY • link 8.7 years ago robert.maughan • 0

score 4 · Answer 1 · 2015-08-27

The duplicateCorrelation approach isn't wrong, but a better strategy would be to just block directly on patient in the design matrix. Let's say we have:

patient <- factor(rep(1:24, each=2))
time <- factor(rep(c("before", "after"), 24))
treatment <- factor(c(rep("AB", 6*2), rep("A", 9*2),
    rep("B", 2*2), rep("None", 7*2)))

We then set up the design matrix as shown below. Note some fiddling is required to get it into full column rank.

design <- model.matrix(~0 + patient + time:treatment)
design <- design[,!grepl("before", colnames(design))]

Running colnames(design) gives us something like this:

 [1] "patient1"                "patient2"               
      ... a bunch of patients ...
[23] "patient23"               "patient24"              
[25] "timeafter:treatmentA"    "timeafter:treatmentAB"  
[27] "timeafter:treatmentB"    "timeafter:treatmentNone"

The first 24 coefficients represent gene expression before treatment in each patient. The last four coefficients represent the effect of each type of treatment - drug A by itself, A and B together, B by itself, or nothing at all. We'll rename them here for convenience:

colnames(design)[25:28] <- levels(treatment)

To test for the overall effect of drug A or B, you can set up these contrasts:

con.a <- makeContrasts(A - None, levels=design)
con.b <- makeContrasts(B - None, levels=design)

This will test for whether the effect of treatment with either drug is significantly different to that of no treatment. To test for an interaction (i.e., non-additive) effect, you can do something like the code shown below. This determines whether the sum of the individual treatment effects is equal to the effect of the combined treatment.

con.inter <- makeContrasts((A-None) + (B-None) - (AB-None), levels=design)

Blocking directly on patient in the design matrix provides some advantages over blocking through duplicateCorrelation. The latter makes some assumptions about how the patient effects behave (e.g., homogeneously distributed, no outliers or batch effects within patients) and shares information across genes to precisely estimate the correlation (which will introduce some amount of bias when the true gene-level correlations are different). These issues can be ignored when blocking in the design matrix - albeit at the cost of power, as the information used to estimate the patient effects could have been used to estimate the treatment effects.

score 0 · Answer 2 · 2015-08-27

0

Entering edit mode

robert.maughan • 0

@robertmaughan-8096

Last seen 7.4 years ago

Ireland

Hi Aaron,

Thanks for the helpful response.

I absolutely agree that blocking for patient by actually including it in the model is a better approach than the duplicateCorrelation method. I believe that I went with duplicateCorrelation due to difficulties in getting a fullrank matrix.

Also you're suggestion of splitting groups into A, B, AB and None is something I hadn't actually thought of and that I'll definitely try out. I like that this approach is conceptually simpler than the usual 2x2 interaction model and probably doesn't face any issues with the unbalanced design. Also it avoids the confusing and sometimes unimportant coefficients produced by the x*y*z interaction models.

My concern with this method however, is the loss of the subjects receiving AB in the testing of the effect of drug A or B. In the ~TIME*drugA*drugB model, it's my understanding that all subjects who are on A including those receiving A and B will be used in testing for the effect of A and this test will therefore have more power. This is particularly relevant in the Drug B-only subgroup which only contains 2 subjects.

So my original logic was to test for the presence of a drugA*drugB interaction in all patients on drugs A and B and maybe subsequently follow up with simpler models for those in which there was no evidence for an interaction. Would this not be a recommended approach in this context? - also remember the unbalanced design and possible difficulties in interactions with these..

Thanks again,

Rob

ADD COMMENT • link 8.7 years ago robert.maughan • 0

3

Entering edit mode

You're right in that the patients treated with the AB combination will not be used in estimating the treatment effect of drug A or B in my design. Which is fine, because those patients are needed for estimating the interaction effect. You wouldn't be able to do that if you assumed that the effect of the combination was additive. In any case, the AB patients probably won't be used in estimation of the drug A/B treatment effect in the ~TIME*drugA*drugB model either - these patients will have an interaction term that absorbs any DE if you were to drop the drugA or drugB coefficients by themselves.

If you do want to get information from those patients, then you have two possibilities. The first is to set up a contrast that includes the AB patients. For example, the contrast below looks at the difference in treatment effects between AB and A, and asks whether the average of this with the treatment effect of B is equal to zero:

con <- makeContrasts(((AB-A) + (B-None))/2, levels=design)

This only works for those genes that don't exhibit an interaction effect, as it assumes that the treatment effect of B is additive upon that of A in the combination group. Thus, subtracting the A effect from the AB effect will give an estimate of the B effect, which can be tested against zero alongside B-None. Alternatively, you could fit an additive model:

is.None <- factor(grepl("None", treatment))
is.A <- factor(grepl("A", treatment))
is.B <- factor(grepl("B", treatment))
design.add <- model.matrix(~0 + patient + time:is.None + time:is.A + time:is.B)
design.add <- design.add[,!grepl("before|FALSE", colnames(design.add))]

You can then test the coefficients for A or B against None, as I described in my first answer. This will use the AB patients to estimate the effect of each drug. Again, you should only do this for those genes that don't have any evidence of an interaction. In general, I prefer using the full model if possible, as that provides more appropriate variance estimates for EB shrinkage; it also simplifies the set of DE comparisons later on, as I'm only using one design matrix throughout the analysis.

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

score 0 · Answer 3 · 2015-08-28

0

Entering edit mode

robert.maughan • 0

@robertmaughan-8096

Last seen 7.4 years ago

Ireland

Many thanks Aaron,

I'll go through what you've suggested and give it a try and will report back.

Thanks,

Rob

ADD COMMENT • link 8.7 years ago robert.maughan • 0

score 0 · Answer 4 · 2015-09-28

Hi Aaron,

I've opted for your suggested approach and I think it's serving me well, many thanks for that.

I have another problem however. Since this is a non-randomised study there are continuous covariates which I would like to adjust/control for. From reading the support questions here, I've seen that this can be achieved by simply adding them to the model.

I've tried the following and it isn't working unfortunately:

design2<-model.matrix(~0+patient+Covariate+time:treatment)
design2<-design2[,!grepl("wk0", colnames(design2))]
fit<-lmFit(eset,design2)

Giving me the warning:

Coefficients not estimable: Covariate 
Warning message:
Partial NA coefficients for 55 probe(s)

Looking at the design matrix I reckon that lmFit doesn't know what to do with the value of Covariate for every 0 or 1 possibility of patient. Is there any way to include a patient blocking directly in the model as well as a covariate, or am I forced to use duplicateCorrelation for either patient or covariate?

Many thanks,

Robert

score 0 · Answer 5 · 2015-10-30

Apologies Aaron, I didn't get an email alert to this so didn't see your response.

There are actually two covariates, baseline viral load and total cholesterol. Since there's a few analyses going on lets just take the example of 3 treatment groups with baseline and week 2 measurements. So the design matrix this produces looks something like this:

Patient 1	…	Patient 3	Viral load	TotChol	Treat1:wk2	Treat2:wk2	Treat3:wk2
1	…	0	4.611723	5.1	1	0	0
0	…	0	5.752048	4.9	1	0	0
0	…	1	3.763428	4.9	0	1	0
0	…	0	5.328380	5.4	0	1	0
0	…	0	5.419956	5.1	0	0	1
0	…	0	4.970347	4.7	0	0	1
…	…	…	…	…	…	…	…

In the end I found that if I removed the patient blocking and switched to the duplicateCorrelation function that it worked. I found this suggested by Gordon here https://stat.ethz.ch/pipermail/bioconductor/2013-October/055651.html

I figured that the baseline viral load and totalcholesterol were confounded with the patient variable as it's the same baseline measure repeated for each timepoint of the patient.

Thanks,

Robert