How to get normalized values for methylation counts in edgeR
Entering edit mode
mico • 0
Last seen 5 hours ago
United States

Hi, I have a methylation dataset and it's single cell. I wonder how can I get appropriate values from the edgeR glmfit object for downstream clustering. This is what I've run:

design0 <- model.matrix(~0 + sample + celltype, data=file.df)
design <- modelMatrixMeth(design0)

y <- estimateDisp(y, design=design, trend="none")
fit1 <- glmQLFit(y, design, robust=TRUE)

then I got celltype-specifc sites using glmQLFTest(fit1, contrast=contr) for each celltype (the contr is like 1 vs others in the single-cell pseudobulk analysis from edgeR user guide).

I then extracted significant celltype-specific sites and their logFC values and drew heatmap - however the plot doesn't show a nice pattern. Seems like the values were not regressed out for celltype quite well, and for some statistically significant sparse counts they are not shrunk. Is there an alternative metric for visualization? thanks.

edgeR MethylationArray • 185 views
Entering edit mode
Last seen 7 minutes ago
WEHI, Melbourne, Australia

I have no experience with single-cell BS-seq. Nevertheless, to export methylation values from edgeR, I guess I would probably use M-values as explained in the methylation workflow:

I'm not sure what you mean by getting values from glmFit, because output from glmFit is not used for clustering. The linear model model fit already has the structure imposed by the linear model and clustering to find a different structure doesn't make sense. Clustering would only make sense on the original data, e.g., the M-values.

logFC values from edgeR are shrunk and the shrinkage can be increased by setting prior.count. For heatmap purposes we usually use prior.count=2.

Later thoughts

Are you analysing the data at the single-cell level or are you somehow forming pseudo-bulk data? I have no experience with pseudo-bulk for BS-seq. I'm a bit worried that simply summing up the counts as for pseudo-bulk RNA-seq might not work well because the pairing of the methylated and unmethylated counts for each cell/loci would get lost.

Entering edit mode

Thank you for your response! Yes clustering using M-values showed that the variations majorly come from samples and cell types.

To get the sites, I first used pseudobulk data to call all possible sites, then partitioned the counts into individual cell types - some clusters have many cells whereas others only have dozens of cells, so counts in many of the sites in small clusters are inevitably sparse. I ended up using prior.count=4 and logFC values are indeed shrunk more than before. The thing is, I expected to see different modules that are highly active in particular celltypes, like distinct "blocks" popping out in the heatmap, but that's just not happening with my data. Guess it could just be how this dataset rolls..


Login before adding your answer.

Traffic: 655 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6