Before finding DMPs from 450k data, it is suggested to adjust the data for various covariates like cell count etc.
From various forums and literatures, I have come to know there are two methods to adjust the covariates:
Method 1:
fit <- lm (outcome ~ exposure + covariate) group <- factor(targets$status,levels=c("Control","Case")) design <- model.matrix(~target$ly+target$mo+target$gr+group) fit <- lmFit(Mval,design) fit.e <- eBayes(fit) top<-topTable(fit.e,coef=4, num=Inf)
Fitting the data by regressing outcome (Methylation data) with exposure (case-control status) while adjusting for covariates (cell count) -> using eBayes() to test if the parameter estimate for case/Control is different from zero, using the residuals -> using fitted values for finding DMPs.
The other way is:
Method 2:
fit1 <- lm (outcome ~ 0 + covariate) Adj. Outcome <- Raw outcome – fit1 fit2<- lm (Adj. Outcome ~ exposure) design <- model.matrix(~0+target$ly+target$mo+target$gr) fit <- lmFit(Mval,design) mAdj.fit <- fit$coefficients mAdj <- as.matrix(Mval) - mAdj.fit %*% t(design) betaAdj <- m2beta(mAdj)
Making a null model (using covariates) -> subtracting the coefficients (regression of Null model and Raw outcome) from raw outcome -> Regressing the residuals with exposure ->converting Mval to beta-values and finding DMPs.
The question is what is the difference between the two methods, as the results I am getting from both are different.
Which method is correct one?
Also, why I am getting different results in Method-2 when I am using mAdj (min q-value of top hit = 0.5) and betaAdj (min q-value of top hit = 0.001) for finding DMPs.
I am really sorry, I just mixed up two different analysis codes. Actually, in this case I am not subsetting any coefficient.
But the question about Method-2 is why can't we find the best fit model for all the covariates and subtract the resulting coefficient from raw methylation values and then use the residuals for subsequent analysis.
Method 2 is not statistically rigorous. You're treating the residuals as observed values when you use them for downstream analysis. However, their values depend on the estimated coefficients. If you don't model the uncertainty in coefficient estimation, you'll underestimate the variability in the residual values, and this will result in loss of type I error control later on.
Perhaps this simple simulation will convince you. Here I've got 5 batches with 2 samples in each batch. For
est1
, I estimate the variance directly from the linear model. In contrast, forest2
, I estimate the variance of the residuals.So you can see that the latter approach will systematically underestimate the true variance. This is because it thinks that we have 9 residual degrees of freedom, whereas we only actually have 5 (one per batch in the linear model).
Now, in some cases, removing uninteresting effects beforehand is necessary if your downstream procedures can't handle them, e.g., when making heatmaps, PCA plots. This is probably okay, as underestimation of the variance isn't a big deal for visualization. However, if you want to do hypothesis testing, you need to be more quantitative, and that includes accounting for estimation uncertainty for the uninteresting coefficients. See
?removeBatchEffect
for similar sentiments.