Hello,

I'm analyzing DNA methylation microarray data from whole blood. My goal is to construct a linear model to check differences between 2 groups, young and old individuals (meth ~ age.group, which is categorical)

I know that blood cell type composition in the samples is mostly 'completely' confounded with age.group (the differences in cell type percentages between old and young are much more relevant than intragroup differences). Because of how SVA works, first regressing out my model and then finding covariates in the residual variation, my question is:

1) does this imply that the algorithm won't detect/correct this confounding (or the majority of it)?

2) can the solution be to use a reference-based cell type correction algorithm (such as Houseman, 2012) to take into account this confounder and use SVA afterwards?

2.1) if this approach is correct, is it better to correct my DNA methylation values first and input the corrected values to SVA, or to input uncorrected values to SVA and, in the null model, incorporate the cell type compositions as covariates?

thanks

Hi Papyrus,

I also encountered this problem but fortunately, in my dataset, the cell type compositions were not associated with my covariate of interest. If you have the cell type compositions data, why wouldn't you put those together in your formula (meth ~ age.group + cellA + cell B + cell C + and so on)? Incorporation of the nuisance covariates in the linear model would be a better approach as you would also take into account the uncertainty of the measurement for the composition.

However, if your nuisance covariates and covariates of interest are highly correlated, then I suspect that adjusting for one would remove the variance of the other due to the collinearity.

Best, Mikhael