Question

Use of confounders in downstream analysis

0

Entering edit mode

Aileen Bahl ▴ 10

@aileen-bahl-6392

Last seen 9.0 years ago

Finland

Dear all,

I have some problems in understanding how exactly to include confounders in my downstream analysis. I will provide a short description of my analysis and problem and I would be very happy if some of you could help me understanding how exactly to go ahead with that:

I normalized 450k data and then used lmFit() to find differentially methylated CpGs. My design matrix looks like this: model.matrix(~Pair+FatPercentage+EstradiolLevel). So, basically I want to identify CpG sites that are associated with changes in estradiol levels. As I want to perform within-pair analysis of monozygotic twins I added pair information looking like c(1,1,2,3,2,3...). I also added the fat percentage as a confounder as we saw significant correlations with the first principal component of the data. Does this look right to you?

Now, after having identified significantly differentially methylated CpGs, we want to use the GSA package and look at correlations between methylation and expression data. For GSA the pairs can be specified directly in the function call. Does that also work with continuous traits or only if you have to groups? Additionally, I am not really sure how to include confounders then. Do I have to use adjusted or unadjusted data? If I use adjusted data, would I use the same design matrix as above and not include pair information in the function call? Would that be still a within-pair comparison then? And for the adjustment itself, would it be something like adj.m <- normalizedM-fit$coef[,-1]%*%t(myDesign[,-1]) or do I also have to include the columns for pair and fat percentage in this adjustment somehow? If I don't have to use unadjusted data, how would I include information on fat percentage and the estradiol levels then?

Similarly, for the correlations between methylation and expression... Do I just use the adjusted data sets and then compute correlations over all individuals? Is that then still considering the within-pair changes? Or would I use delta betas for correlation analysis? In the latter case, would I use adjusted data? Would that then be like adjusting for pair twice if I use the design matrix from above? Or would I have to change the matrix and if yes, how?

One last thing - say I wanted to perform differential analysis between two groups (not within-pair) but still have some twin pairs included in the analysis, would I then used duplicateCorrelation() instead of including the pair information directly in the design matrix? Or if that's not the right way to go, what should I do in that case?

Sorry for that many questions! However, I would really appreciate any kind of help or ideas, to be able to understand how to go on...

Thanks a lot in advance and best regards,

Aileen

limma GSA confounding within-pair analysis 450k • 1.7k views

ADD COMMENT • link 9.0 years ago Aileen Bahl ▴ 10

score 0 · Answer 1 · 2015-04-21

I would personally go a different route. Using data from individual CpGs is likely to be much less informative than using information from contiguous CpGs, based on the assumption that regions of CpGs tend to get differentially methylated, rather than individual CpGs. You can use bumphunter from within the minfi package to identify 'bumps' in your methylation data that correlate with estradiol level, after accounting for twins and fat percentage.

Once you have found significantly differentially methylated regions, I wouldn't use GSA. I am not sure how you would use GSA anyway - it is intended to compare lists of genes from two different experiments, not gene expression and methylation.

Instead, I think a more reasonable approach is to take a set of genes that are in CIS with the methylation region, and see if the expression of those genes correlates with methylation (e.g., fit a model with methylation as the explanatory variable, and gene expression as the dependent variable, along with whatever confounders you want to include). You could also look for genes that are TRANS as well, but I tend not to do that, because you run into multiplicity problems and it's biologically harder to sell anyway.

Anyway, I have a small package that is intended to do this on github , that you can install using devtools:

library(devtools)
install_github("jmacdon/methylation")

I wrote the package for my personal use, so it's a bit rough around the edges, but feel free to try it out. If you do try it out and have questions, please contact me off-list: jmacdon at u.washington dot edu

score 0 · Answer 2 · 2015-04-21

0

Entering edit mode

Aileen Bahl ▴ 10

@aileen-bahl-6392

Last seen 9.0 years ago

Finland

Hi,

thanks for your answer!

I already used bumphunter but it couldn't identify any significant bumps. So, we would still like to use GSA in order to see whether there are any pathways with higher or lower methylation across multiple genes in this pathway. This would be done separately for methylation and expression then. So, basically, just identifying all CpGs that belong to genes in the same pathway in the former case. However, the question remains how to include the variables from the design matrix here...

Your idea for the correlations is good. I will try to use your package and come back to you if necessary.

Thanks,

Aileen

ADD COMMENT • link 9.0 years ago Aileen Bahl ▴ 10

0

Entering edit mode

But how do you define which gene a CpG 'belongs' to? That's why I correlate to all genes in CIS (where that is defined as a gene that is in a 1 Mb window, centered on the CpG region, but can be changed if you have different ideas of what CIS means).

ADD REPLY • link 9.0 years ago James W. MacDonald 65k

score 0 · Answer 3 · 2015-04-21

I also define according to the location of the CpG site. I assign every CpG on or up to 1,500 bp upstream of the corresponding gene to the corresponding gene (just as it is done in illuminaHumanMethylation450k.db). However, no matter how one defines the associations between CpGs and genes, the problem is the same - we still have to adjust for pair information and confounders (so define the actual methylation values for each CpG).