Question: limma Multi-level Experiments correcting for continuous covariates
0
6 months ago by
c.bettencourt0 wrote:

I have DNA methylation data from Illumina EPIC arrays for 10 patients and 6 controls, with 3 different tissues per individual (run in 3 different days). I have normalised data (BMIQ, champ.norm()) and removed batch affect (champ.runCombat()) following ChAMP pipeline. Two tissues are affected in the disease I am studying, and the 3rd is almost unaffected. The design is quite complex, so I need to perform multilevel analysis, both within and between subjects. To complicate things there is a difference in age and PMD between patients and controls, so I would like to correct for these covariates (continuous).

The questions I want to answer are:

1. Which are the differentially methylated probes (DMPs) between patients and controls for each tissue?
2. Within the patients group, which are the DMPs between the affected tissues versus the less affected or unaffected tissues?
3. Which are the DMPs between the affected tissues versus the unaffected tissues that are specific to the disease and not the tissue?

I have read the limma user guide which I found was very helpful. As my study design is complex and I am very new to DNA methylation analysis, I would like to be sure my analyses are correct.

If I just wanted to answer question 1 (between subjects analysis, correcting for age and PMD), could I use the following approach?

Approach 1

f <- factor(myLoad$pd$Group_Region)

PMD <- as.numeric(myLoad$pd$PMD)

Age <- as.numeric(myLoad$pd$Age)

design <- model.matrix(~0+f+Age+PMD)

fit <- lmFit(mval,design)

cm <- makeContrasts(

MSAvsCTRL_OCTX = fMSA_OCTX-fCTRL_OCTX,

MSAvsCTRL_FCTX = fMSA_FCTX-fCTRL_FCTX,

MSAvsCTRL_CRBL = fMSA_CRBL-fCTRL_CRBL,

levels=design)

fit2 <- contrasts.fit(fit, cm)

fit2 <- eBayes(fit2)

To answer the other questions I need to use multi-level analysis. I am unsure though if I can include the covariates or not. I have tried design 1 and design 2 (approaches 2 and 3) below, and design 2 (approach 3) would give me DMPs numbers similar to my approach 1 for the contrasts between patients and controls only. Could you let me know if my approach 3 is correct please?

Approach 2

regions <- factor(paste(myLoad$pd$Group,myLoad$pd$Region,sep="."))

design1 <- model.matrix(~0+regions)

colnames(design1) <- levels(regions)

corfit <- duplicateCorrelation(mval,design1,block=myLoad$pd$ID)

corfit$consensus fit <- lmFit(mval,design1,block=myLoad$pd$ID,correlation=corfit$consensus)

cm <- makeContrasts(

MSAvsCTRL_OCTX = MSA.OCTX-CTRL.OCTX,

MSAvsCTRL_FCTX = MSA.FCTX-CTRL.FCTX,

MSAvsCTRL_CRBL = MSA.CRBL-CTRL.CRBL,

MSA_FCTXvsOCTX = MSA.FCTX-MSA.OCTX,

MSA_CRBLvsOCTX = MSA.CRBL-MSA.OCTX,

MSA_CRBLvsFCTX = MSA.CRBL-MSA.FCTX,

MSA_FvsO_CRTL_FvsO = (MSA.FCTX-MSA.OCTX)-(CTRL.FCTX-CTRL.OCTX),

MSA_CvsO_CRTL_CvsO = (MSA.CRBL-MSA.OCTX)-(CTRL.CRBL-CTRL.OCTX),

MSA_CvsF_CRTL_CvsF = (MSA.CRBL-MSA.FCTX)-(CTRL.CRBL-CTRL.FCTX),

levels=design1)

fit2 <- contrasts.fit(fit, cm)

fit2 <- eBayes(fit2)

Approach 3

regions <- factor(paste(myLoad$pd$Group,myLoad$pd$Region,sep="."))

design2 <- model.matrix(~0+regions+Age+PMD, myLoad$pd) colnames(design2) <- c("CTRL.CRBL", "CTRL.FCTX", "CTRL.OCTX", "MSA.CRBL", "MSA.FCTX", "MSA.OCTX", "Age", "PMD") corfit <- duplicateCorrelation(mval,design2,block=myLoad$pd$ID) corfit$consensus

fit <- lmFit(mval,design2,block=myLoad$pd$ID,correlation=corfit$consensus) cm <- makeContrasts( MSAvsCTRL_OCTX = MSA.OCTX-CTRL.OCTX, MSAvsCTRL_FCTX = MSA.FCTX-CTRL.FCTX, MSAvsCTRL_CRBL = MSA.CRBL-CTRL.CRBL, MSA_FCTXvsOCTX = MSA.FCTX-MSA.OCTX, MSA_CRBLvsOCTX = MSA.CRBL-MSA.OCTX, MSA_CRBLvsFCTX = MSA.CRBL-MSA.FCTX, MSA_FvsO_CRTL_FvsO = (MSA.FCTX-MSA.OCTX)-(CTRL.FCTX-CTRL.OCTX), MSA_CvsO_CRTL_CvsO = (MSA.CRBL-MSA.OCTX)-(CTRL.CRBL-CTRL.OCTX), MSA_CvsF_CRTL_CvsF = (MSA.CRBL-MSA.FCTX)-(CTRL.CRBL-CTRL.FCTX), levels=design2) fit2 <- contrasts.fit(fit, cm) fit2 <- eBayes(fit2) Also, do you think my last 3 contrasts are answering my question 3? I would be very grateful if someone could give me some reassurance or some suggestions to correct and/or improve my analyses. Many thanks!!!! ADD COMMENTlink modified 6 months ago by Aaron Lun23k • written 6 months ago by c.bettencourt0 You need to give more information about your experimental design. What's "affected", "less affected", "unaffected"? What is "MSA"? What is "FCTX", "CRBL", "OCTX"? You mention a disease - what is it? You should also provide the smallest subset of your sample table that captures the complexity of your experiment; this will make it much easier to communicate. ADD REPLYlink written 6 months ago by Aaron Lun23k Many thanks Aaron for getting back to me. MSA (multiple system atrophy) is the disease group (CTRL are the controls), and there are 3 brain tissues per individual (CRBL, FCTX, and OCTX), not equally affected in this disease. OCTX is unaffected, FCTX is mildly affected and CRBL is largely affected in MSA, so in my contrasts within MSA I always compare a more severely affected with a less affected tissue. Please see below a subset of my sample sheet.  Sample_Name ID Group Region Group_Region Age PMD Sex Sample_Run CTRL1_Cbs P31_05 CTRL CRBL CTRL_CRBL 91 19.7 F 1 CTRL1_Obs P31_05 CTRL OCTX CTRL_OCTX 91 19.7 F 2 CTRL1_Fbs P31_05 CTRL FCTX CTRL_FCTX 91 19.7 F 2 CTRL2_Cbs P61_09 CTRL CRBL CTRL_CRBL 99 32.1 F 1 CTRL2_Obs P61_09 CTRL OCTX CTRL_OCTX 99 32.1 F 2 CTRL2_Fbs P61_09 CTRL FCTX CTRL_FCTX 99 32.1 F 2 MSA1_Cbs P70_06 MSA CRBL MSA_CRBL 50 30.3 M 1 MSA1_Obs P70_06 MSA OCTX MSA_OCTX 50 30.3 M 2 MSA1_Fbs P70_06 MSA FCTX MSA_FCTX 50 30.3 M 2 MSA2_Cbs P87_08 MSA CRBL MSA_CRBL 72 88.8 M 1 MSA2_Obs P87_08 MSA OCTX MSA_OCTX 72 88.8 M 2 MSA2_Fbs P87_08 MSA FCTX MSA_FCTX 72 88.8 M 2 CTRL3_Cbs P94_05 CTRL CRBL CTRL_CRBL 71 38.8 M 2 I hope this answers your questions, if not please let me know. Many thanks!!! ADD REPLYlink written 6 months ago by c.bettencourt0 Answer: limma Multi-level Experiments correcting for continuous covariates 5 6 months ago by Aaron Lun23k Cambridge, United Kingdom Aaron Lun23k wrote: I don't really know what you mean by multi-level analysis, this entire set-up seems pretty straightforward to me. Use the design matrix from your approach 1 (you may want to consider including Sex as well): f <- factor(data$Group_Region)
PMD <- as.numeric(data$PMD) Age <- as.numeric(data$Age)
design <- model.matrix(~0+f+Age+PMD)

... and block on ID in duplicateCorrelation to account for the patient-specific effect. I'll assume that you don't have a confounded experimental design (e.g., patients are not all male and controls are not all female).

Question 1: differences between disease and control.

As you've previously described in your approach 1, though I'm a bit bemused about what "fCTRL_CRBL" means; how can you have a tissue that's "largely affected" by the disease in control patients? (Same for all of the other definitions.) I'll assume that these regions are principally defined by measures other than disease effect.

Question 2: differences between regions within disease.

Just do the same contrast as you've set up in your approach 2. No need to change the design matrix.

cm <- makeContrasts(
MSA_OCTX_vs_FCTX = fMSA_OCTX - fMSA_FCTX,
MSA_OCTX_vs_CRBL = fMSA_OCTX - fMSA_CRBL,
MSA_FCTX_vs_CRBL = fMSA_FCTX - fMSA_CRBL,
levels=design)

Question 3: disease-specific differences between regions.

Just do the same contrast as you've set up in your approach 3. Again, no need to change the design matrix.

cm <- makeContrasts(
(fMSA_OCTX - fMSA_CRBL) - (fCTRL_OCTX - fCTRL_CRBL), # etc., you can figure it out.
levels=design)

This will identify genes where the log-fold change between regions differs between control and disease. You should be able to intersect this with the set of DE genes that are significant in MSA_OCTX_vs_FCTX to find the genes that you want. Note that this will not necessarily filter out genes that are DE between regions in the control individuals, provided that the inter-region log-fold change in the controls is significantly different from the inter-region log-fold change in the diseased patients. I'd argue that this is the desired behaviour, as such genes exhibit evidence of dysregulation in the disease state and are interesting.

Dear Aaron, many thanks for your great explanation.

As I am new to this type of analysis, I would just like to double check and/or clarify a few things:

1) Would you run all contrasts at once or separately by question 1, 2 and 3?

2) Regarding your question about what does "fCTRL_CRBL" means, I am analysing tissue from 3 brain regions (cerebellum, frontal cortex and occipital cortex) of both MSA patients and control individuals, CTRL_CRBL is tissue from the cerebellum  of control individuals.

3) I have both males and females in MSA patients and control individuals and don't see a sex effect in SVD, that's why I haven't included sex in the model. Would you include it?

4) Regarding your explanation for question 3

cm <- makeContrasts(
(fMSA_OCTX - fMSA_CRBL) - (fCTRL_OCTX - fCTRL_CRBL), # etc., you can figure it out.
levels=design)

Just to make sure I understood, these contrasts will only give me the significant fold change differences between the tissue comparisons in patients and controls. In that case I would better answering question 3 by simply filtering out genes that are differentially methylated in CTRL_OCTX vs CTRL_CRBL from those that are differentially methylated in MSA_OCTX vs MSA_CRBL. Don't you think?

Thanks again!

1. Doesn't matter in terms of contrasts.fit, you can always ask for specific results in topTable by setting coef to the desired column index of the contrasts matrix. Of course, if you supply a vector of indices, then you're doing an ANOVA with a joint null hypothesis - probably not what you want.
2. Okay.
3. It's up to you, but usually sex has a big effect, on XIST if nothing else.
4. Your proposed approach would indeed be the first thing that comes to mind. However, the problem is that you are effectively intersecting the DE set in the MSA OCTX vs CRBL comparison with the non-DE set in the corresponding control comparison. Our entire hypothesis testing framework is built around controlling the type I error rate, i.e., false positives among the genes that we find significant. We cannot say much about the genes that we do not find to be significant, because we aren't controlling the type II error rate. Are non-significant genes truly non-DE with a log-fold change of zero, or are they DE but we just don't have enough power to detect them? There's no easy way to tell. (It's not impossible, but gets ugly quickly.) Remember, for your question, false negatives in the control comparison would end up being false positives because you're including genes that shouldn't be in the final intersection. That's why I've tried to frame the answer to your question in terms of significant findings, for which we can control the error rate with limma.

On the (MSA_OCTX - MSA_CRBL) - (CTRL_OCTX - CTRL_CRBL) type of contrast all adj. p-values are very high and none falls below 0.05. I will have to think a bit more about point 4.

Many thanks Aaron for all the great advise and explanations!

Keep in mind that zero DE != wrong. Certainly I've watched collaborators - against all advice - waste time and money trying to validate DE genes that were never there to begin with.

I'll also add that, in light of this latest result, I would be very suspicious of any genes that you might find by intersecting the non-DE set from the control comparison with the DE set from the disease comparison. What you really don't want is a gene that is non-DE in control because its effect size is just below the threshold for significance; this will be included in the intersection even if the effect size in disease is not much different (or even smaller, depending on the experimental design)! TL;DR If these genes were really DE in disease only, they should have a significant interaction effect, and the fact that there are no such genes is not something to be swept under the rug.

I agree with you that zero DMPs is not necessarily wrong!

In our case, I think zero significant differences in comparisons like (MSA_O - MSA_C) - (CTRL_O - CRTL_C) may tell us that the effect size of the overlapping (patients and controls) significant differences between tissues (Occipital vs Cerebellum) does not differ significantly between the two subject groups, which per se is a finding! There may still be disease only differences, but the problem is: how to distinguish true non-DMPs from "lack of power" non-DMPs in controls?

Many thanks!

Your question is a classic one that is very difficult to answer. If you really want to know: the simplest approach is to use two one-sided tests (TOST). It requires a user-specified bound T for the absolute log-fold change, below which you consider expression to be equivalent between groups A and B. You then test for whether A + T is significantly greater than B, and whether B + T is significantly greater than A. If both of these comparisons are significant, then you can say that the absolute log-fold change is less than T. You can see how we have reframed the non-DE question in terms of rejecting null hypotheses, allowing us to take advantage of type I error control.

In practice, TOST is a very conservative procedure, only yielding significant hits when T is so large as to be meaningless. I would also say that it doesn't really help you validate your claim, either, because there's too much of a grey zone between the two sets of results. That is, you have DE genes from your standard DE analysis; you have significantly equivalent genes from TOST; and then you have the rest of genes for which you cannot say anything, which probably makes up the majority of the genes. One can hardly say that two tissues are transcriptionally identical when you can't make any conclusions for most of the genes in the transcriptome!

The standard solution would be to increase power with more samples to overcome the conservativeness of TOST. However, I suspect that this would be quite a costly endeavour. Probably a better approach would be to accept that there are probably some changes between tissues that you don't have the power to detect, and to water down your claims appropriately.