Question

How to get normalized values for methylation counts in edgeR

0

Entering edit mode

mico • 0

@mico-15362

Last seen 22 months ago

United States

Hi, I have a methylation dataset and it's single cell. I wonder how can I get appropriate values from the edgeR glmfit object for downstream clustering. This is what I've run:

design0 <- model.matrix(~0 + sample + celltype, data=file.df)
design <- modelMatrixMeth(design0)

y <- estimateDisp(y, design=design, trend="none")
fit1 <- glmQLFit(y, design, robust=TRUE)

then I got celltype-specifc sites using glmQLFTest(fit1, contrast=contr) for each celltype (the contr is like 1 vs others in the single-cell pseudobulk analysis from edgeR user guide).

I then extracted significant celltype-specific sites and their logFC values and drew heatmap - however the plot doesn't show a nice pattern. Seems like the values were not regressed out for celltype quite well, and for some statistically significant sparse counts they are not shrunk. Is there an alternative metric for visualization? thanks.

edgeR MethylationArray • 1.2k views

ADD COMMENT • link 23 months ago mico • 0

score 1 · Accepted Answer · 2024-02-07

I have no experience with single-cell BS-seq. Nevertheless, to export methylation values from edgeR, I guess I would probably use M-values as explained in the methylation workflow: https://bioinf.wehi.edu.au/edgeR/F1000Research2017/

I'm not sure what you mean by getting values from glmFit, because output from glmFit is not used for clustering. The linear model model fit already has the structure imposed by the linear model and clustering to find a different structure doesn't make sense. Clustering would only make sense on the original data, e.g., the M-values.

logFC values from edgeR are shrunk and the shrinkage can be increased by setting prior.count. For heatmap purposes we usually use prior.count=2.

Later thoughts

Are you analysing the data at the single-cell level or are you somehow forming pseudo-bulk data? I have no experience with pseudo-bulk for BS-seq. I'm a bit worried that simply summing up the counts as for pseudo-bulk RNA-seq might not work well because the pairing of the methylated and unmethylated counts for each cell/loci would get lost.