strange results from differential methylation analysis using limma
2
0
Entering edit mode
RC • 0
@rc-9372
Last seen 2.9 years ago
Beijing

I want to identify differentially methylated probes(DMPs) between two groups.

After preprocessed raw data, 666805 probes remained. Differential methylation analysis was performed on the remain probes using limma. Age and gender information were used as covariates. Probes with BH adjusted P value<0.05 were considered as DMPs.

As a result, 134841 probes were identified as DMPs. It reveals nearly 1/5 probes was differentially methylated between our two groups. btw :the chip we used is Illumina Infinium MethylationEPIC BeadChip. It is kind of strange because samples from both groups were healthy people. They just grouped by different living environment.

So, we asked one technician to perform differential methylation analysis on the same 666805 probes. As a result, 8388 probes observed as DMPs,using IMA. I compared those two result, only 3037 probes were the same.

It is so strange that the numbers of two results is extremely different. I am newcomer to this research field, and i did not figure out the errors in my code. So i paste the code for DMP detection here and hope someone can help me.

library(limma)
mval<-log2(betaval/(1-betaval))

var<-model.matrix(~Age + as.factor(Gender) + as.factor(Group),data=targets)
colnames(var)<-c("control","age","gender","group")
fit<-lmFit(mval,var)
fit2<-eBayes(fit)
write.table(sig.probe,file="Result.DMP.txt",sep="\t",quote=TRUE)

limma methylation methyanalysis IMA • 2.4k views
0
Entering edit mode

Assuming the "data_afterpreprocessed.txt" file contains beta-values of normalised samples, in the same order as the samples in the samplesheet file, I can't see anything terribly wrong with the code. Perhaps if you do an plotMD(fit2, column=4) or volcanoplot(fit2, coef=4) you might see what's driving so much differential methylation.  Or pick a probe that IMA fails to report as significant, and examine its mval profile and the coefficients and residuals from limma.

0
Entering edit mode

Could you please help me how to make Methylation data input file like should I take CpG probes as rows or column? Many thanks

0
Entering edit mode

You should ask a new question, but the short answer is that features such as CpG probes are in rows.

1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 9 hours ago
The city by the bay

Your limma code looks fine, assuming Gender and Group only have two levels and Age is a real-valued covariate. The only extra things would be to set trend=TRUE and robust=TRUE in eBayes, but I don't think it would make much of a dent in your 130000 DMPs. If you think that number's too high, investigate whether there are confounding factors (e.g., batch) that might be driving systematic differences between groups.

I can't speak for why the limma results disagree with those from IMA, you'll have to ask the IMA authors. Also see comments about testing for differential methylation: C: What is logFC and "B" in methylation analysis ??.

0
Entering edit mode

Hi Aaron, thanks for your advice. Yes, the gender and group were two-level categorical variables and the Age is continuous variable. I have tried to add trend=TRUE and robust=TRUE option in  eBayes, but it failed to make a big difference. At this time, i used beta value instead of M values. Without the two extra options, i got 143346 DMPs. After i set the trend=TRUE and robust=TRUE in eBayes, i got 144154 DMPs.

About the batch effect, i am not sure i get it correctly. My data was generated from the chip in the same batch instead of collected/downloaded separately. So, i thought there may not exist a strong batch effect (Am i wrong？). Then i checked the corresponding location on the chip for each sample and i found out samples were classified in different "sentrix Barcode". PCA were used to testify the effect of "sentrix Barcode". The PCA plot of beta values revealed that samples were blend together, indicating it wasn't the confounding factors.

0
Entering edit mode

Well, there are other confounding factors. For example, are the people in both groups of the same age? Sex? Ethnic/genetic background? Did they have the same diet? Lifestyle? Human data can be riddled with batch effects and hidden factors of variation, because it's hard to keep people in cages for a properly controlled experiment.

If you're sure that these factors are not the cause, the only thing to do at this point would be to follow Gavin's suggestion and have a closer look at the DMPs. For example, if the DMPs are being driven by one or two samples in a particular group, then perhaps you should take a closer look at those samples and see if they warrant special consideration. If all the DMP log-fold changes are in the same direction, perhaps there was a problem during normalization and pre-processing of the data.

0
Entering edit mode

Well, there is no significant difference in age between  the two groups. So is the sex ratio. And all samples share the same ethnic background. You know, it's hard to control their diet and lifestyle. Considering the same ethnic background, diet or lifestyle is unlikely to be the confounding factor with a strong impact.  I rechecked each information of samples(including: age, sex, education, ethnic group), no significant difference were found except their living environment, which is also our target factor.

I am following Gavin's suggestion and check the probes that IMA fails to report as significant.

0
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

It is not entirely clear that there is any problem here.

IMA uses limma as the testing method in their vignette:

however the default in IMA is to do a Wilcoxon test for each site. Obviously a Wilcoxon test will be far less powerful than the limma test when you don't have a lot of replicates and will result in fewer DMPs. If you tell your "technician" to use the "limma" option when running IMA, you may find that it gives the same result as your limma analysis. If not, then the pre-processing must be different in some way.

Have you considered trying missMethyl (which also uses limma under the hood):

0
Entering edit mode

Hi Gordon.  sorry i forget to mention that the "technician" using testmethod = "pooled" in IMA. And we used the same preprocessed data(666805 probes) to find DMPs, just using different methods. So, the cause of difference must in the "Find DMP" step.

We will try to use the "limma" option in IMA, any result will be updated here. Thanks for your suggestions.