LIMMA for paired analysis with adjustment for confounding continuous variables
2
0
Entering edit mode
Loodramon • 0
@loodramon-9126
Last seen 6.0 years ago
Austria

Hi,

I'm analysing a longitudinal gene expression qPCR array study in LIMMA. I have 2 groups with baseline and week 24 measurements. I also have 2 confounding baseline continuous variables that I would like to include in the model.

I found something similar discussed here: https://stat.ethz.ch/pipermail/bioconductor/2013-October/055651.html

But I'm not sure if their confounding variable CES is of the same nature as mine, i.e. only a baseline measure. Also this type of analysis is different to many of the less complex studies I've done previously - so I thought I'd check.

Nevertheless I've conducted the analysis as follows and would like to make sure that I'm on the right path.

design<-model.matrix(~0+time:treatment+confound1+confound2)

colnames(design)[3:6]<-c("t1wk0","t1wk24","t2wk0", "t2wk24") #produces a design matrix that looks like:   confound1  confound2 t1wk0 t1wk24 t1wk0 t2wk24 1  4.611723     5.1     1      0     0      0 2  5.752048     4.9     1      0     0      0 3  3.763428     4.9     1      0     0      0 4  5.328380     5.4     1      0     0      0 5  5.419956     5.1     1      0     0      0 6  4.970347     4.7     0      0     1      0 7  3.857935     4.7     1      0     0      0 8 4.927370     3.3     1      0     0      0 9 5.875061     4.1     1      0     0      0 10 4.580925     4.9     0      0     1      0 11 4.442480     4.3     0      0     1      0 12 4.338456     6.3     0      0     1      0 13 4.611723     5.1     0      1     0      0 14 5.752048     4.9     0      1     0      0 15 3.763428     4.9     0      1     0      0 16 5.328380     5.4     0      1     0      0 17 5.419956     5.1     0      1     0      0 18 4.970347     4.7     0      0     0      1 19 3.857935     4.7     0      1     0      0 20 4.927370     3.3     0      1     0      0 21 5.875061     4.1     0      1     0      0 22 4.580925     4.9     0      0     0      1 23 4.442480     4.3     0      0     0      1 24 4.338456     6.3     0      0     0      1

corfit <- duplicateCorrelation(eset,design,block=patient) corfit$consensus #where patient is a categorical variable denoting which patient is which# fit_adj<-lmFit(eset,design,block=patient,correlation=corfit$consensus) fit_adj<-eBayes(fit_adj)

#Then pull out the comparisons of interests with specific contrasts: con.t1<-makeContrasts(t1wk24-t1wk0, levels=design) con.t2<-makeContrasts(t2wk24-t2wk0, levels=design) con.t1vt2<-makeContrasts(((t1wk24-t1wk0)-(t2wk24-t2wk0)), levels=design) #etc

I think this is method is doing okay since the results are making sense and are similar to previous analyses. But I'm concerned that the value of the baseline confounding variables will be repeating in the model for both wk0 and wk24 arrays for a patient, will this interfere with the analysis?

Thanks

limma gene expression confounding adjustment paired • 2.5k views
0
Entering edit mode

What is the patient blocking factor with respect to the other terms? It's hard to know whether it'll interfere or not if it's unspecified.

0
Entering edit mode

It's a vector containing the Patient IDs as factors. There's 12 patients with 2 arrays each. You'll notice that the confound1 values repeat at array 13, this is the start of the wk24 arrays.

1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 11 hours ago
The city by the bay

It seems that you're interested in the effect of time in each treatment (con.t1, con.t2) as well as whether there's any differential time effect between treatments (con.t1vt2). If that's the case, then we can avoid any problems with confounding variables altogether, by just doing:

patient <- factor(rep(1:12, 2))
time <- factor(rep(c("wk0", "wk24"), each=12))
treatment <- factor(rep(rep(c("t1", "t2", "t1", "t2"), c(5,1,3,3)), 2))
design <- model.matrix(~0 + patient + time:treatment)
design <- design[,-grep("timewk0", colnames(design))] # full rank

Looking at colnames(design) gives us:

 [1] "patient1"             "patient2"             "patient3"
[4] "patient4"             "patient5"             "patient6"
[7] "patient7"             "patient8"             "patient9"
[10] "patient10"            "patient11"            "patient12"
[13] "timewk24:treatmentt1" "timewk24:treatmentt2"

The first 12 coefficients represents the "wk0" expression for each patient, while the last two represent the effect of time for each treatment (i.e., effect of time within patients, averaged across all patients with the same treatment). Dropping 13 and 14 will perform an equivalent DE comparison to con.t1 and con.t2, respectively, while equating them in a contrast will give the same test as con.t1vt2.

This design avoids having to deal with confounding variables as they are fully absorbed into the patient blocking factor - this is completely fine, as both samples for each patient have the same values for the confounders anyway. I suspect it's also a better model as it avoids making assumptions about the linearity/additivity of gene expression with respect to changes in the confounding variables. Nonetheless, if you want to use your original model, I don't see any inherent problems with having the same confounders for both patient samples.

0
Entering edit mode
Loodramon • 0
@loodramon-9126
Last seen 6.0 years ago
Austria
Thanks for your answer Aaron. Yes what you've suggested is pretty much my 'unadjusted' model. However, because there are differences at baseline between the two groups in the confounding variables I wanted to include them in the model to account for DE genes in the direct between group comparison that may be caused by differences in these. Yes I believe that I am assuming an additive effect of the confounding, I may try an interaction term to account for this, but I find these models difficult to interpret. Ideally I suppose I should include the patient directly in my model rather than using duplicateCorrelation. I tried this model and it was confounded, so I guess I'll use the above. Thanks again
0
Entering edit mode

To clarify, differences in baseline (i.e., "wk0") expression between individuals or treatments shouldn't cause DE genes in your contrasts, as long as you include patient in the design matrix. Any individual- or treatment-specific contributions to baseline expression will be absorbed by the patient blocking factors and negated. DE can only be observed if there are differences between time points within the same patient - the absolute value of the baseline expression doesn't matter.

0
Entering edit mode

The goal of including the values of the baseline clinical variables that show between group differences is more to control for effects similar to regression to the mean. I.e. If the expression of a gene increased in one of the groups but not the others and that group which had the increase had the highest levels of one of the confounding clinical variables - we would like to adjust this gene expression response for the baseline level of the confounding variable in order to enable better between group comparisons.

I've ran the analysis using the model I set out above (using duplicateCorrelation) with the confounding variables in and with them out and the LogFCs are exactly the same but the pvalues do seem to change but this only amounts to  a difference of 1/2 DE genes. Both of these results are then quite similar to the analysis which includes patient directly in the model. From a conceptual point of view I think I'd rather have the confounding variables in the model. But I'd love to hear from others with experience in this. Thanks

0
Entering edit mode

The need to include confounders in the design would depend on which groups being compared. If you're comparing between, say, t1wk0 and t2wk0, or between t1wk24 and t2wk24, then you may need to account for the confounding variables, because those confounders are different between the groups being compared. However, if you're comparing between t1wk0 and t1wk24, or between t2wk0 and t2wk24 (as in your original post), then the values of the confounders will be the same between groups as the same patients are involved in both groups. As such, they should not contribute to any spurious DE, because the average effect of the confounders across all patients in each group should be the same.

Personally, I think that the direct inclusion of patient in the design matrix is more elegant - it avoids linearity or additivity assumptions and it accounts for additional patient-specific effects on expression. Such effects would be irrelevant when you're looking for within-patient changes across the two time points, but would inflate the variance estimates in the duplicateCorrelation approach, which treats different patients as direct replicates (even when the confounding variables are considered). Well, I guess it's a largely academic argument, if you're getting the same results from both strategies - that said, directly blocking on patient seems safer to me for the comparisons that you're performing.

0
Entering edit mode

Yes I agree that the within patient group can't be confounded as obviously its the same patients but the confounding concern is for the between group response comparison given by the contrast:

con.t1vt2<-makeContrasts(((t1wk24-t1wk0)-(t2wk24-t2wk0)), levels=design)

Also agree regarding including the patient directly in the model, couldn't get it to work here.

0
Entering edit mode

Actually, that contrast will work fine if you block directly on patient in the model. Remember, you're not comparing between groups directly - instead, you're computing the time effect within each treatment, and then testing for differences in the time effect between treatments. More specifically, we know that we can estimate t1wk24-t1wk0 (i.e., the time effect in treatment 1) and t2wk24-t2wk0 (time effect in treatment 2) based on within-patient comparisons; it's a simple matter to compare the two estimates. For the model I've suggested in my original answer, you can just run contrasts.fit with c(rep(0, 12), 1, -1) to perform this test.