Dear all,
This is just a follow-up question from an old post related to RRBS analysis using voomLmFit
with block
.
I am facing an issue where the intra-block correlation is consistently set to zero, meaning that no need to account for blocking, when running voomLmFit
for RRBS analysis with a block
factor (i.e., patient
).
# Experimental design
meta <- data.frame(library = as.factor(1:12),
patient = as.factor(rep(1:6, each = 2)),
disease = rep(c("healthy", "sick"), each = 6),
tissue = rep(c("liver","muscle"), times = 6))
meta$group <- paste(meta$disease, meta$tissue, sep = ".")
# Toy data
set.seed(123)
counts <- matrix(data = round(runif(n = 24000, min = 0, max = 100)),
ncol = nrow(meta)*2)
cols <- expand.grid(c("Me","Un"), meta$library)
colnames(counts) <- paste(cols[,2], cols[,1], sep = "_")
rownames(counts) <- paste("CpG", 1:nrow(counts), sep = "_")
# voomLmFit
library(edgeR)
design <- modelMatrixMeth(~0 + group, meta)
colnames(design) <- gsub("group", "", colnames(design))
v <- voomLmFit(counts = counts,
design = design,
block = rep(meta$patient, each = 2),
sample.weights = TRUE)
# First sample weights (min/max) 0.8816726/1.2959486
# First intra-block correlation 0
# Final sample weights (min/max) 0.8703724/1.3442574
# Final intra-block correlation 0
At first, I suspected that the simulated data in this example might have led to zero intra-block correlation, but I observed the same behavior with real-life data. Running voomWithQualityWeights
and duplicateCorrelation
manually confirms this issue, along with the following warning message:
Block factor already encoded in the design matrix: setting intrablock correlation to zero.
Replicating the voomLmFit
by running voomWithQualityWeights
and duplicateCorrelation
twice confirmed this behavior:
vwqw <- voomWithQualityWeights(counts = counts,
design = design)
corfit <- duplicateCorrelation(object = vwqw,
design = design,
block = rep(meta$patient, each = 2))
# Warning message:
# In duplicateCorrelation(object = vwqw, design = design, block = rep(meta$patient, :
# Block factor already encoded in the design matrix: setting intrablock correlation to zero.
vwqw <- voomWithQualityWeights(counts = counts,
design = design,
block = rep(meta$patient, each = 2),
correlation = corfit$consensus.correlation)
corfit <- duplicateCorrelation(object = vwqw,
design = design,
block = rep(meta$patient, each = 2))
# Warning message:
# In duplicateCorrelation(object = vwqw, design = design, block = rep(meta$patient, :
# Block factor already encoded in the design matrix: setting intrablock correlation to zero.
In fact, the Sample
(library
) is nested within the block
(patient
) inside the methylation design matrix due to the way modelMatrixMeth
is created. Therefore, the warning message makes sense.
head(design)
# Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 Sample11 Sample12 healthy.liver healthy.muscle
# 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0
# 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
# 3 0 1 0 0 0 0 0 0 0 0 0 0 0 1
# 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 0 1 0 0 0 0 0 0 0 0 0 1 0
# 6 0 0 1 0 0 0 0 0 0 0 0 0 0 0
# sick.liver sick.muscle
# 1 0 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
Questions
Q1: Is there a correct approach to account for block
(i.e., patient
) in voomLmFit
, given the warning about the block
factor being encoded in the design matrix? For example, would it make sense to use design from model.matrix
instead of modelMatrixMeth
while running the duplicateCorrelation
?
design_normal <- model.matrix(~0 + group, meta)
design_normal <- design_normal[rep(seq_len(nrow(design_normal)), each = 2), ] #replicate each row twice
design_methyl <- modelMatrixMeth(~0 + group, meta)
vwqw <- voomWithQualityWeights(counts = counts,
design = design_methyl, # design for RRBS
block = rep(meta$patient, each = 2),
correlation = corfit$consensus.correlation)
corfit <- duplicateCorrelation(object = vwqw,
design = design_normal, # regular design
block = rep(meta$patient, each = 2))
Q2: Should the warning message "Block factor already encoded in the design matrix: setting intrablock correlation to zero." from duplicateCorrelation
also be passed to voomLmFit
? Is the current behavior intentional?
Any insights or suggestions would be greatly appreciated. Thank you so much for your help in advance.
Dear Gordon Smyth,
Thank you so much for your detailed response and for updating
voomLmFit
warning messages inherited fromduplicateCorrelation
.I greatly appreciate that you pointed out the flaw on my suggestion of duplicating the rows of normal design matrix. I have been reflecting on your suggestion about estimating the intra-block correlation by fitting the normal design to "methylated" count matrix. I was wondering if there is any particular reason for using the methylated counts specifically, as opposed to the unmethylated counts? I assume there is no critical difference between the two.
Furthermore, would it be reasonable to consider using M-values (or beta-values) since these metrics incorporate information from both methylated and unmethylated counts? To explore further, I tried to simulate this by creating a methylation matrix by adding batch (
patient
) effects to M-values:Then I re-calculated both methylated (
me_count
) and unmethylated (un_count
) counts as well as M (m_values
) and beta (beta_values
):Here are what the MDS plots look like for each matrix which confirm the presence of batch effect:
Finally, I calculated the intra-block correlation using
duplicateCorrelation
for each data type:Considering both batch effect (
batch_effect
) and M-values (M
) were sampled from similar normal distributions (mean = 0, sd = 1), I guess the intra-block correlation coefficient should be theoretically 0.5 in this case, right? If that is the case, I believe using M-values or beta-values forduplicateCorrelation
would be suitable for the downstream analysis. I hope I am not missing something fundamental here and I truly appreciate your feedback.Once again, thank you so much for your continued help, Gordon Smyth
It might be ok to estimate the intrablock correlation from the M-values. But then, if you do that, you may as well do the whole analysis in limma on the M-values.
Thanks a lot for your feedback, Gordon Smyth
If I may, one final question to follow up on your comment limma analysis using M-values: how can I incorporate the
coverage
information to the analysis? Would it be enough to passcoverage
matrix toarrayWeights()
followed byduplicateCorrelation()
andlmFit()
? Alternatively, should the coverage information be passed toweights
parameter of eitherduplicateCorrelation
orlmFit()
? Below are 3 possible approaches I am considering:Any of these approaches will be followed with:
Would you recommend one of these approaches for incorporating coverage most effectively, or is there a preferred strategy that better aligns with limma best practices?
No, none of those approaches would be appropriate ways to incorporate coverage.
I suggest treating the methylation counts like a two-color microarray:
This limma approach will be somewhat more conservative than the edgeR analysis in our workflow, but perhaps rightly so.
This is such an elegant showcase of two-color microarray data analysis. Thank you so much for your great help and valuable feedback.