Question: limma blocking and including covariates
0
2.6 years ago by
gstone20
gstone20 wrote:

I have received extraordinary help on this website, and while I have learned a lot about limma, I am still a beginner and am trying to overcome certain limitations. I am trying to determine the effect of a condition within and between the sexes, and one limitation is that I do not feel comfortable implementing my paired study as a factorial design. I do feel comfortable using duplicateCorrelation() and treating my blocking factor as a random effect, and I have 118 patients/236 samples, so I believe my large sample number allows for this without too much loss.

***I am wondering, though, if the duplicateCorrelation() function blocks both between and within samples. If I use duplicateCorrelation() blocking for my patient variable, should I still adjust for my other covariates in my design matrix, or does blocking on that variable implicitly adjust for other factors between patients?***

Any guidance would be much appreciated!

tl;dr does duplicateCorrelation() adjust for differences within and between samples? should I still adjust for other covariates in my design matrix?

modified 2.6 years ago by Ryan C. Thompson7.4k • written 2.6 years ago by gstone20
Answer: limma blocking and including covariates
1
2.6 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.4k wrote:

Whatever you choose to do, you should only adjust for inter-patient differences once in your design. If all your covariates only vary by patient, then I believe including them in the design is redundant with blocking on patient using duplicateCorrelation, since blocking on patient already adjusts for differences between patients.

On the other hand, if you have covariates that vary by sample within each patient, such as blood test results from the same day the sample was taken, then you should still include these.

2

Ryan is mostly right. However, duplicateCorrelation assumes that the patient effects are homogeneously distributed. This will not be true if there are systematic factors of variation across patients. To illustrate, let's say with have 5 male and 5 female patients, with a pair of treated/untreated samples for each patient - this experimental design is mocked up below:

treatment <- factor(rep(c("Y", "N"), 10))
patient <- factor(rep(1:10, each=2))
sex <- factor(rep(c("M", "F"), each=10))

For demonstration purposes, let's mock up some data as well. First, the expression values:

ngenes <- 1000
y <- matrix(rnorm(ngenes*length(treatment)), nrow=ngenes)

Now adding a random patient-specific effect for each gene:

patient.effect <- matrix(rnorm(ngenes*nlevels(patient)), nrow=ngenes)
y <- y + patient.effect[,patient]

Finally, adding a sex effect. I'm using a very large value here, just to illustrate my point:

y[,sex=="M"] <- y[,sex=="M"] + 10

Now let's run through the two duplicateCorrelation-based options. The first is to not block on sex:

library(limma)
design <- model.matrix(~treatment)
dc <- duplicateCorrelation(y, design, block=patient)
dc$consensus # 0.9692942 ... while the second is to block on sex: design2 <- model.matrix(~sex+treatment) dc2 <- duplicateCorrelation(y, design2, block=patient) dc2$consensus # 0.506429

So clearly, there is a difference in the two strategies. Incidentally, the second approach is correct, as the true correlation here is 0.5. The reason for this difference is because of the sex effect, which causes all samples from patients of the same sex to be more correlated to each other. This inflates the correlation estimate when you don't consider the sex of the patients in the model.

I suspect that this doesn't have much effect in this particular setting, where you're comparing within patients - as far as I understand, things mostly cancel out. Rather, the inflated correlation would be much more damaging in cases where you compare across patients. If you consider an experimental design where each patient is either treated or untreated and has two samples:

treatment <- factor(rep(c("Y", "Y", "N", "N"),  5))
patient <- factor(rep(1:10, each=2))
sex <- factor(rep(c("M", "F"), each=10))

... you'll find that the first strategy is horribly conservative, i.e., type I error rates well below the specified threshold.

Of course, if you directly blocked on patient in the design matrix, additional factors would not be a concern. This seems like a good time to start getting comfortable with more complex models - it probably won't be the last time you'll see them.

Thank you for your elaboration--it's made me think much more. I am looking at the change in gene expression due to condition within each sex and between the sexes (f.post-f.pre)-(m.post-m.pre), so I am comparing across patients. I get the sense from your explanation that it doesn't hurt to add the covariates in the design matrix--am I misunderstanding?

I would like to block on patient in the design matrix, especially given your generous explanation yesterday on its advantages. However, when I use duplicateCorrelation() and randomize the patient effect I get 117 sig genes (224 if I include other covariates in the design matrix). However, when I block on patient in the design matrix, I only get 6 sig genes. Especially since there is clinical evidence to suggest that there should be genetic differences, I'm thinking I must be doing something wrong in implementing your suggestion. Additionally, if I included patient in the design matrix, I'm unclear as to how I would get all 3 comparisons I'm after.

2

It's a bad idea to shop around for the analysis strategy that gives you the most DE genes. As I mentioned before, duplicateCorrelation is somewhat liberal and makes a number of assumptions. How do you know that the greater number of DE genes is real, and not just some consequence of loss of type I error control or violations of the assumptions? Indeed, I often cringe when people try to justify their analysis choices by saying that "the results make biological sense". Well, of course they would; the parameters were chosen to get the results that they wanted, so it's hardly a surprise that they get what they expected. If that logic were valid, we wouldn't need to do experiments at all.

If you want more DE genes, the more honest approach is to increase the FDR threshold. The 5% cutoff is not written in stone - go up to 10% or 20% if you need to. You'll have to be aware that there will be more false discoveries in that set, so you'll need to spend more effort in validation. However, this is preferable to performing an analysis that gives you more DE genes but understates the FDR - in that case, you'll be thinking that your results are a lot more reliable than they actually are.

Finally, all three comparisons are easy to perform when you block on patient in the design matrix. if you want to compare pre/post within each sex, just drop the individual sex:condition interaction terms. To compare the pre/post effects between sexes, just contrast the two interaction terms against each other using makeContrasts.

Thank you for all the help!