Dear all,

I have some problems in understanding how exactly to include confounders in my downstream analysis. I will provide a short description of my analysis and problem and I would be very happy if some of you could help me understanding how exactly to go ahead with that:

I normalized 450k data and then used lmFit() to find differentially methylated CpGs. My design matrix looks like this: model.matrix(~Pair+FatPercentage+EstradiolLevel). So, basically I want to identify CpG sites that are associated with changes in estradiol levels. As I want to perform within-pair analysis of monozygotic twins I added pair information looking like c(1,1,2,3,2,3...). I also added the fat percentage as a confounder as we saw significant correlations with the first principal component of the data. Does this look right to you?

Now, after having identified significantly differentially methylated CpGs, we want to use the GSA package and look at correlations between methylation and expression data. For GSA the pairs can be specified directly in the function call. Does that also work with continuous traits or only if you have to groups? Additionally, I am not really sure how to include confounders then. Do I have to use adjusted or unadjusted data? If I use adjusted data, would I use the same design matrix as above and not include pair information in the function call? Would that be still a within-pair comparison then? And for the adjustment itself, would it be something like adj.m <- normalizedM-fit$coef[,-1]%*%t(myDesign[,-1]) or do I also have to include the columns for pair and fat percentage in this adjustment somehow? If I don't have to use unadjusted data, how would I include information on fat percentage and the estradiol levels then?

Similarly, for the correlations between methylation and expression... Do I just use the adjusted data sets and then compute correlations over all individuals? Is that then still considering the within-pair changes? Or would I use delta betas for correlation analysis? In the latter case, would I use adjusted data? Would that then be like adjusting for pair twice if I use the design matrix from above? Or would I have to change the matrix and if yes, how?

One last thing - say I wanted to perform differential analysis between two groups (not within-pair) but still have some twin pairs included in the analysis, would I then used duplicateCorrelation() instead of including the pair information directly in the design matrix? Or if that's not the right way to go, what should I do in that case?

Sorry for that many questions! However, I would really appreciate any kind of help or ideas, to be able to understand how to go on...

Thanks a lot in advance and best regards,

Aileen

But how do you define which gene a CpG 'belongs' to? That's why I correlate to all genes in

CIS(where that is defined as a gene that is in a 1 Mb window, centered on the CpG region, but can be changed if you have different ideas of whatCISmeans).