Limma for paired and unpaired analysis with adjustment for categorical and continuous variables
Entering edit mode
DSP • 0
Last seen 6.0 years ago

Dear Gordon and all,

I'm analyzing Illumina 450K methylation data with following study design:

Study design:  I have individuals at a given time-point from three different stages: Stage 1, 2 and 3 and I wish to identify differentially methylated CpGs in these groups. Stage 1 vs Stage 2; Stage 1 vs Stage 3 and Stage 2 vs Stage 3

I have preprocessed my data and adjusted for batch effects. Following that, I have done PCA and regression to identify potential covariates and found blood cell estimates account for variation explained by top 10 PCs.

### I've used the following code for unadjusted analysis:

Treat_outcome <- factor(Targets_Unapired$Outcome)
design_outcome_unadjusted <- model.matrix(~0+Treat_outcome)
colnames(design_outcome_unadjusted) <- levels(Treat_outcome)
cm_outcome_unadjusted <- makeContrasts(NGTvsIGT = NGT-IGT,T2DvsIGT = T2D-IGT, NGTvsT2D = T2D-NGT, levels = design_outcome_unadjusted) 
fit_outcome_unadjusted <- lmFit(Mval, design_outcome_unadjusted)
fit_outcome_final_unadjusted <-, cm_outcome_unadjusted)
fit_outcome_final_eBayes_unadjusted <- eBayes(fit_outcome_final_unadjusted)
NGTvsIGT_unadjusted_BH <- topTable(fit_outcome_final_eBayes,coef="NGTvsIGT",adjust="BH",number=100)
T2DvsIGT_unadjusted_BH <- topTable(fit_outcome_final_eBayes,coef="T2DvsIGT",adjust="BH",number=100)
NGTvsT2D_unadjusted_BH <- topTable(fit_outcome_final_eBayes,coef="NGTvsT2D",adjust="BH",number=100)

The above code worked fine. However when I try to introduce the cell estimates as covariates as in the following code, I get the following errors:

design_outcome_cellcounts <- model.matrix(~0+Treat_outcome+Targets_Unpaired$CD8T+Targets_Unpaired$CD4T+Targets_Unpaired$NK+Targets_Unpaired$Bcell+Targets_Unpaired$Mono+Targets_Unpaired$Gran)
cm_outcome_cellcounts <- makeContrasts(NGTvsIGT = NGT-IGT,T2DvsIGT = T2D-IGT, NGTvsT2D = T2D-NGT, levels = levels(Treat_outcome)) 
fit_outcome_cellcounts <- lmFit(Mval, design_outcome_cellcounts)
fit_outcome_final_cellcounts <-, cm_outcome_cellcounts)
fit_outcome_final_eBayes <- eBayes(fit_outcome_final_cellcounts)

## Error

Error in, cm_outcome_cellcounts) : 
  Number of rows of contrast matrix must match number of coefficients
In addition: Warning message:
In, cm_outcome_cellcounts) :
  row names of contrasts don't match col names of coefficients

Could you please suggest a way out?

Thanks in advance.




limma illumina450k lmfit • 1.7k views
Entering edit mode

Also I need to the same adjustments for another study design, where I have data from same individual at two timepoints. Baseline samples are all at stage 1 who either reach Stage 2 or Stage 3 at follow up timepoint.

I wish to capture differentially methylated CpGs resulting in transition from Stage 1 to Stage2 and Stage 1 to Stage 3 and look for any overlap.

As suggested by Aaron a while ago, I wish to go ahead with duplicate correlation function for the same. But I need to adjust these results for cell counts.

Entering edit mode

As a side note: Since you are analyzing methylation M-values, I recommend running eBayes with trend=TRUE. Due to the logistic transformation, extreme M-values tend to have higher variance than M-values near the middle of the range, so a plot of variance vs mean for all probes will look somewhat U-shaped. Running eBayes with trend=TRUE will model this mean-variance dependence  and should give you better results.

Entering edit mode
Last seen 3 months ago
Scripps Research, La Jolla, CA

The levels argument to makeContrasts should be your design matrix, design_outcome_cellcounts.

Entering edit mode

Dear Dr Thompson,

Thanks for correcting the code, it seems to be working fine now.

I was wondering what would be a better strategy to adjust for batch effects
1) Apply combat and use batch corrected values for differential methylation analysis 
2) Use batch variables in the design matrix

My data seems to have strong batch effects, nevertheless when I apply combat, I observe that the variation due to objects of interest like sample class/timepoint/outcome is also being nullified. 

Could you suggest a way out?

Thank you.

Entering edit mode

It's almost always preferable to model the batch effects in the design matrix rather than try to subtract the batch effects from the data. The only time you would do the latter is if you are applying a method that cannot handle batch effects at all, such as many clustering or machine learning algorithms.

Entering edit mode

Dear Dr Thompson.

Thanks for the suggestion.

I have done analysis on my paired data (pre-post comparison) using duplicate correlation function and modelling batch effects in the design matrix. I have done sequential adjustment for covariates:

1) Unadjusted data

2) Adjusted for batch effects:chip and position on chip

3) Adjusted for batch effects and cell counts

I have the following doubts:

1) When I adjust for chip and position on chip, I have an error "Coefficients not estimable: Treat_p_PositionR06C02 "

After QC, there are very few samples which come from Position R06C02 on the chip. Is the lack of enough data points from that factor a reason for coefficients not being estimable? Will it affect the results in any way?

2) When I try to adjust my data for cell counts, I get the following warning message: 

In, dy, coef.start = start, tol = tol, maxit = maxit,  :Too much damping - convergence tolerance not achievable

Nevertheless, the contrasts are being calculated and I get the results. However, there are no statistically significant DMCpGs after including cell counts in the model, unlike after adjustment for just batch effects.

Is it of some biological relevance or could it be because of the warning message?

The reason I wish to adjust for cell counts is because they are known to affect DNA methylation estimates. Will adjusting the data for cell counts prior to differential analysis be a better alternative than including them in the model matrix, since there are no differences in cell counts from pre-timepoint to post-timepoint?

Thank you.

Entering edit mode
  1. Yes, if all the samples in R06C02 belong to the same treatment group, then you cannot separate the effect of that chip position separately from the effect of that treatment group. This results in non-estimable coefficients. You should avoid models with non-estimable coefficients, as these will generally not give results with any meaningful interpretation. I can't tell you what the best way is to model the batch effects in your model. You should consult an experienced statistician at your workplace.
  2. I can't tell you what that warning means when I don't know what code you are using to adjust for cell counts, nor do I know where you got the cell counts. Adjusting methylation data for cell type composition is a complicated proposition. In principle, the cell counts should be linearly related to the beta values and therefore not linearly related to the M-values, so including cell counts as covariates in a linear model fit of M-values is suboptimal and might cause problems  In theory, you could first adjust the beta values for cell type composition and then transform the betas to M-values and proceed with modeling. However, I have not been able to make this approach work in practice, and I'm not sure whether it's a problem with the approach or with my data, so I don't know if I can recommend this approach in general.

A more generic approach would be to forget about the cell counts, chips, and chip positions, and instead use the sva package to estimate surrogate variables that account for any systematic variation in the residuals of your model. Unlike the cell counts, the surrogate variables should be linearly related to the M-values. And unlike the chip and position-on-chip variables, the surrogate variables should not result in non-estimable coefficients. Again, though, depending on the nature of your dataset, sva may or may not be appropriate, and if you're not sure, you should try to consult with a local statistician.

Entering edit mode

Thank you so much for the details. I've estimated cell counts using estimateCellCounts function from minfi:

cell_counts <- estimateCellCounts(RGSet.raw, compositeCellType = "Blood",processMethod = "preprocessNoob", probeSelect = "auto",cellTypes = c("CD8T","CD4T", "NK","Bcell","Mono","Gran"),returnAll = TRUE, meanPlot = TRUE, verbose = TRUE)

I've used the following code for DMA:

##  Loading covariates

Treat_p_outcome <- factor(Targets_Paired$Sample_Outcome_Details)
Treat_p_Chip <- factor(Targets_Paired$Sentrix_ID_coded)
Treat_p_Position <- factor(Targets_Paired$Sentrix_Position)
Treat_p_CD8T <- Targets_Paired$CD8T
Treat_p_CD4T <- Targets_Paired$CD4T
Treat_p_NK <- Targets_Paired$NK
Treat_p_Bcell <- Targets_Paired$Bcell
Treat_p_Mono <- Targets_Paired$Mono
Treat_p_Gran <- Targets_Paired$Gran

## Adjusted_paired_for batch effects and cell counts:

design_paired_outcome_adjusted_batch_cellcounts <- model.matrix(~0+Treat_p_outcome+Treat_p_Chip+Treat_p_Position+Treat_p_CD8T+Treat_p_CD4T+Treat_p_NK+Treat_p_Bcell+Treat_p_Mono+Treat_p_Gran)
cm_paired_adjusted_batch_cellcounts <- makeContrasts(NGTvsIGT = Treat_p_outcomeNGT-Treat_p_outcomeIGT_NGT,T2DvsIGT = Treat_p_outcomeT2D-Treat_p_outcomeIGT_T2D,T2DvsNGT = (Treat_p_outcomeT2D-Treat_p_outcomeIGT_T2D)-(Treat_p_outcomeNGT-Treat_p_outcomeIGT_NGT),levels=design_paired_outcome_adjusted_batch_cellcounts)
corfit_paired_adjusted_batch_cellcounts <- duplicateCorrelation(M_val, design_paired_outcome_adjusted_batch_cellcounts, block=Targets_Paired$SampleID)

## Input the inter-subject correlation into the linear model fit:

fit_paired_adjusted_batch_cellcounts <- lmFit(M_val, design_paired_outcome_adjusted_batch_cellcounts, block=Targets_Paired$SampleID, correlation=corfit_paired_adjusted_batch_cellcounts$consensus)
fit_paired_adjusted_batch_cellcounts_final <-,cm_paired_adjusted_batch_cellcounts)
fit_paired_adjusted_batch_cellcounts_final_eBayes <- eBayes(fit_paired_adjusted_batch_cellcounts_final)

The error comes when I'm trying to use duplicate correlation function.I'm relatively new to handling high-throughput data, still got to figure out the best way to go about! I would try adjusting beta values for cell type composition and alternately use SVA as you've suggested.

Thanks again!



Login before adding your answer.

Traffic: 885 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6