Adjustment of covariates in 450k data using Limma
Entering edit mode
AST ▴ 60
Last seen 19 months ago

Before finding DMPs from 450k data, it is suggested to adjust the data for various covariates like cell count etc.

From various forums and literatures, I have come to know there are two methods to adjust the covariates:

Method 1:

fit <- lm (outcome ~ exposure + covariate)

group <- factor(targets$status,levels=c("Control","Case"))
design <- model.matrix(~target$ly+target$mo+target$gr+group)
fit <- lmFit(Mval,design)
fit.e <- eBayes(fit)
top<-topTable(fit.e,coef=4, num=Inf)

Fitting the data by regressing outcome (Methylation data) with exposure (case-control status) while adjusting for covariates (cell count) -> using eBayes() to test if the parameter estimate for case/Control is different from zero, using the residuals -> using fitted values for finding DMPs.

The other way is:

Method 2:

fit1 <- lm (outcome ~ 0 + covariate)
Adj. Outcome <- Raw outcome – fit1
fit2<- lm (Adj. Outcome ~ exposure)

design <- model.matrix(~0+target$ly+target$mo+target$gr)
fit <- lmFit(Mval,design) <- fit$coefficients
mAdj <- as.matrix(Mval) - %*% t(design)
betaAdj <- m2beta(mAdj)

Making a null model (using covariates) -> subtracting the coefficients (regression of Null model and Raw outcome) from raw outcome -> Regressing the residuals with exposure ->converting Mval to beta-values and finding DMPs.

The question is what is the difference between the two methods, as the results I am getting from both are different.

Which method is correct one?

Also, why I am getting different results in Method-2  when I am using mAdj (min q-value of top hit = 0.5) and betaAdj (min q-value of top hit = 0.001) for finding DMPs.

limma covariates 450k • 3.0k views
Entering edit mode
Aaron Lun ★ 28k
Last seen 7 hours ago
The city by the bay

I don't work with methylation data, so I'll just give general comments. To me, your method 1 seems eminently more sensible. You fit one model that contains all factors of interest, and then you test for the effect of case against control to identify significant differences (assuming that the fourth coefficient represents the case/control log-fold change). Nice and simple, assuming the case/control comparison isn't confounded by the cell counts.

Your method 2 is strange from a theoretical and practical perspective. Firstly, removal of the uninteresting effects is not a free lunch; you need to estimate their coefficients, and if you treat the corrected M-values as the observed values, downstream linear modelling steps won't be able to account for the uncertainty with which those coefficients are estimated. This would probably result in some liberalness. Secondly, if you want to compute residuals, then I don't know why you're subsetting away the first two coefficients. Assuming that ly, mo and gr are real-valued covariates, your mAdj calculation would be equivalent to only removing the effect of gr on the observed values. I've no idea whether that's a sensible thing to do; but if you want to compute residuals, you should be using all your coefficients.

Finally, as far as I know, there's no m2beta function in limma. But if you're choosing between using M-values or beta-values in your linear model, it seems that M-values are generally more appropriate in the methylation context.

Entering edit mode

I am really sorry, I just mixed up two different analysis codes. Actually, in this case I am not subsetting any coefficient.

But the question about Method-2 is why can't we find the best fit model for all the covariates and subtract the resulting coefficient from raw methylation values and then use the residuals for subsequent analysis.

Entering edit mode

Method 2 is not statistically rigorous. You're treating the residuals as observed values when you use them for downstream analysis. However, their values depend on the estimated coefficients. If you don't model the uncertainty in coefficient estimation, you'll underestimate the variability in the residual values, and this will result in loss of type I error control later on.

Perhaps this simple simulation will convince you. Here I've got 5 batches with 2 samples in each batch. For est1, I estimate the variance directly from the linear model. In contrast, for est2, I estimate the variance of the residuals.

est1 <- est2 <- list()
for (it in 1:1000){
    y <- rnorm(10) # True variance of 1
    batch <- factor(rep(1:5, each=2))
    fit <- lm(y~batch)
    est1[[it]] <- summary(fit)$sigma^2
    est2[[it]] <- var(residuals(fit))
mean(unlist(est1)) # gives me something close to 1
mean(unlist(est2)) # gives me something close to 0.55

So you can see that the latter approach will systematically underestimate the true variance. This is because it thinks that we have 9 residual degrees of freedom, whereas we only actually have 5 (one per batch in the linear model).

Now, in some cases, removing uninteresting effects beforehand is necessary if your downstream procedures can't handle them, e.g., when making heatmaps, PCA plots. This is probably okay, as underestimation of the variance isn't a big deal for visualization. However, if you want to do hypothesis testing, you need to be more quantitative, and that includes accounting for estimation uncertainty for the uninteresting coefficients. See ?removeBatchEffect for similar sentiments.


Login before adding your answer.

Traffic: 483 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6