multiple comparison using limma
3
1
Entering edit mode
@annabellecongras-12119
Last seen 5.5 years ago

Hi,

I am using limma to get differentially methylated probes from MethylEPIC data, using M values. I have 9 groups of samples (from groupA to group I) within 3 arrays, containing 2 to 5 samples each.

I get very different results depending on the number of comparison I perform ie the number of contrasts in my contrast matrix.

For example I have a number of significant differential probes much lower  in the groupD-groupC comparison when using :

Group2 <- factor(pheno$Sample_Group2) design2 <- model.matrix(~0+Group2, data=targets) colnames(design2) <- levels(Group2) fit2a <- lmFit(M_flt, design2) x2<-c("groupA-groupB", "groupA-groupC", "groupD-groupC", "groupE-groupC") contMatrix2 <- makeContrasts(contrasts=x2, levels=design2) fit2b <- contrasts.fit(fit2a, contMatrix2) fit2b <- eBayes(fit2b) summ_group2 <- summary(decideTests(fit2b, adjust.method="BH", p.value=0.05)) than when I add several other comparisons: Group2 <- factor(pheno$Sample_Group2)
design2 <- model.matrix(~0+Group2, data=targets)
colnames(design2) <- levels(Group2)
fit2a <- lmFit(M_flt, design2)
x2<-c("groupA-groupB", "groupA-groupC", "groupD-groupC", "groupE-groupC", "groupD-groupE", "groupF-groupG", "groupH-groupI")
contMatrix2 <- makeContrasts(contrasts=x2, levels=design2)
fit2b <- contrasts.fit(fit2a, contMatrix2)
fit2b <- eBayes(fit2b)

summ_group2 <- summary(decideTests(fit2b, adjust.method="BH", p.value=0.05))

I'm moving from 100000 differentially methylated probes in case 1 to 200000 in case 2! With different logFC values as well. Moreover some probes in the 100000 are not in the 200000 and the top ten most significant are not the same..

Is it correct to use all my data in the design matrix and the linear model if I perform comparisons on a subset of group only as in case 1?  Would it be more powerful to remove the unused data before performing the linear model?

If I keep all data in my design matrix and linear model, which method is the most accurate? I don't know if i'm underestimating significance in case 1 or overestimating it in case 2..

Hope you can help, sorry if the question has already been asked by newbies like me!

1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 9 hours ago
The city by the bay

You must have done something wrong here. Let me mock up some pretend data:

set.seed(100)
Group2 <- factor(paste0("group", rep(LETTERS[1:9], each=3)))
group.means <- matrix(rnorm(nlevels(Group2)*100), ncol=nlevels(Group2))
M_flt <- matrix(rnorm(length(Group2)*100, mean=group.means[,Group2]),
ncol=length(Group2))

... and run your first block of code (slightly paraphrased);

design2 <- model.matrix(~0+Group2)
colnames(design2) <- levels(Group2)
fit2a <- lmFit(M_flt, design2)
x2 <- c("groupA-groupB", "groupA-groupC", "groupD-groupC", "groupE-groupC")
contMatrix2 <- makeContrasts(contrasts=x2, levels=design2)
fit2b <- contrasts.fit(fit2a, contMatrix2)
fit2b <- eBayes(fit2b)


Looking at summ_group2 gives me:

   groupA-groupB groupA-groupC groupD-groupC groupE-groupC
-1            11            13            11            11
0             78            77            76            80
1             11            10            13             9

... while if we re-run it with the additional contrasts:

x2.alt <- c("groupA-groupB", "groupA-groupC", "groupD-groupC", "groupE-groupC",
"groupD-groupE", "groupF-groupG", "groupH-groupI")
contMatrix2.alt <- makeContrasts(contrasts=x2.alt, levels=design2)
fit2b.alt <- contrasts.fit(fit2a, contMatrix2.alt)
fit2b.alt <- eBayes(fit2b.alt)
summ_group2.alt <- summary(decideTests(fit2b.alt, adjust.method="BH", p.value=0.05))

... this gives me:

   groupA-groupB groupA-groupC groupD-groupC groupE-groupC groupD-groupE
-1            11            13            11            11            12
0             78            77            76            80            76
1             11            10            13             9            12
groupF-groupG groupH-groupI
-1             8            13
0             79            75
1             13            12

The output of topTable is also the same for each matching contrast between the two runs above. I can only assume that your data or design matrix is changing between the two runs. Incidentally, your design matrix is being formed from targets\$Group2, not Group2. This is probably not what you intended (though it's probably not the cause of your problem either).

0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

Possibly orthogonal to your question, but for methylation array analysis (well, methylation analysis in general), it is a better idea to use something like minfi or DMRcate, DSS, etc to find differentially methylated regions rather than relying on information from individual CpG probes.

The reason for this is that methylation appears to be a regional phenomenon, where you expect contiguous CpGs to be similarly methylated, and measures of CpG methylation tend to be rather noisy. If you borrow information from adjacent CpG probes to estimate the regional methylation status, you are likely to find things that are biologically meaningful (e.g., regions of consistently differently methylated CpGs) and ignore things that may not be (e.g., single CpGs that appear to be highly differentially methylated in a region where no other CpGs agree).

0
Entering edit mode
@annabellecongras-12119
Last seen 5.5 years ago

Hi,

You were right Aaron I might have done something wrong cause it's working fine now after removing all data and re-running my script line by line.. thanks anyway

James,I actually used minfi for the normalisation/filtering steps, limma for DMP analysis, and I'm now moving to DMRcate for DMR analysis. I might come back soon with questions about that!

Annabelle

1
Entering edit mode