Using surrogate variables as covariates in WGBS differential methylation calling
1
0
Entering edit mode
gjl • 0
@gjl-21007
Last seen 2.1 years ago

I have WGBS data of a normal (n=6) and disease-state (n=6) tissue. I believe that the data needs to be corrected for confounding variables, since it was processed in multiple batches. I don't expect many large differences between the normal and disease-state either, so SVA seems to be the best available correction method since it will remove unknown confounders as well.

However, SVA was designed for normally distributed data, and WGBS data is far from normally distributed. I can't find any literature on using SVA on WGBS, so I'm not sure if it would be appropriate. SVA has been performed on M-values from methylation chip data though.

So I guess my question is: Are there any papers/software that perform SVA on WGBS data and use the surrogate variables in some type of linear model? I know DSS.general and dmrseq allow for covariates to be used in their models, but again, I am not sure if using surrogate variables on ratio data is appropriate.

Possible routes for analysis:

Plan 1 (if possible): Use DSS.general and/or dmrseq and/or any other software with surrogate variables as covariates.
Pros: uses count-based statistics and removes all confounding effects

Plan 2: Use DSS.general and/or dmrseq (I believe that the modeling in these packages are similar, but they use different approaches to find FDR) with known confounders (like batch and age) as covariates.
Pros: uses count-based statistics Cons: can't remove the effects of unknown confounders

Plan 3: Smooth data and obtain methylation levels for high-coverage CpGs (BSmooth). Transform to M-values and use limma (with surrogate variables from M-values as covariates) to find differentially methylated loci. Pros: can implement SVA Cons: can't use count-based statistics; results from M-value analysis are not always biologically relevant

dmrseq sva DSS • 335 views
0
Entering edit mode
keegan ▴ 60
@keegan-11987
Last seen 4 months ago

Hi,

Great question. I'm not aware of any publications that perform SVA on WGBS data to adjust for batch effects. However, in principle, you could do so following the same principles of microarray, RNAseq, etc data. While SVA may have originally been developed for normally distributed data, I am not aware that it is a requirement. And SVA has been applied to appropriately transformed non-normal data (see 10.1093/nar/gku864 for example). You are correct to point out that applying SVA on methylation counts/proportions directly would not directly translate to adjusting in the models of dmrseq or dss. However, if you'd like to go with plan 1, using dmrseq for example, I'd recommend applying SVA to the arcsine-link transformed methylation proportions, since this is the transform used in the modeling of DMRs. Using example data from the dmrseq package, this could look like:

library(dmrseq)
library(sva)

data(BS.chr21)
bs <- BS.chr21[240001:290000,]
meth.mat <- as.matrix(asin(2*getMeth(bs, type = "raw")-1))
m1 <- model.matrix(~pData(bs)$CellType) m0 <- cbind(m1[,1]) svseq = sva(meth.mat,m1,m0,n.sv=1) pData(bs)$sv <- svseq\$sv

regions <- dmrseq(bs, cutoff = 0.05,
testCovariate = "CellType",

The downside here is that the SVA estimation does not fully take advantage of the count information (e.g. a methylation value with coverage of 1 is treated the same as the same value with coverage of 20). However, you would still be able to take advantage of the advanced modeling in dmrseq or DSS with this type of approach (as opposed to using limma in your plan 3).

I'd recommend comparing your results to the adjustment with known batches (your plan 2), and I wouldn't recommend adjusting for more than 1 SV, since your sample size is rather low. You are also correct in that dmrseq is similar to DSS, except that dmrseq is more focused on regions, and as such uses a different approach in order to accurately control error rate of DMR discovery.

Best, Keegan