Questions

Question

Zero intra-block correlation in RRBS analysis using voomLmFit with block

0

Entering edit mode

altintas.ali • 0

@altintasali-11054

Last seen 17 days ago

Denmark

Dear all,

This is just a follow-up question from an old post related to RRBS analysis using voomLmFit with block.

I am facing an issue where the intra-block correlation is consistently set to zero, meaning that no need to account for blocking, when running voomLmFit for RRBS analysis with a block factor (i.e., patient).

# Experimental design
meta <- data.frame(library = as.factor(1:12),
                   patient = as.factor(rep(1:6, each = 2)),
                   disease = rep(c("healthy", "sick"), each = 6),
                   tissue = rep(c("liver","muscle"), times = 6))
meta$group <- paste(meta$disease, meta$tissue, sep = ".")

# Toy data
set.seed(123)
counts <- matrix(data = round(runif(n = 24000, min = 0, max = 100)),
                 ncol = nrow(meta)*2)
cols <- expand.grid(c("Me","Un"), meta$library)
colnames(counts) <- paste(cols[,2], cols[,1], sep = "_")
rownames(counts) <- paste("CpG", 1:nrow(counts), sep = "_")

# voomLmFit
library(edgeR)
design <- modelMatrixMeth(~0 + group, meta)
colnames(design) <- gsub("group", "", colnames(design))

v <- voomLmFit(counts = counts, 
               design = design, 
               block = rep(meta$patient, each = 2),
               sample.weights = TRUE)

# First sample weights (min/max) 0.8816726/1.2959486
# First intra-block correlation  0
# Final sample weights (min/max) 0.8703724/1.3442574
# Final intra-block correlation  0

At first, I suspected that the simulated data in this example might have led to zero intra-block correlation, but I observed the same behavior with real-life data. Running voomWithQualityWeights and duplicateCorrelation manually confirms this issue, along with the following warning message:

Block factor already encoded in the design matrix: setting intrablock correlation to zero.

Replicating the voomLmFit by running voomWithQualityWeights and duplicateCorrelation twice confirmed this behavior:

vwqw <- voomWithQualityWeights(counts = counts, 
                               design = design)
corfit <- duplicateCorrelation(object = vwqw, 
                               design = design, 
                               block = rep(meta$patient, each = 2)) 

# Warning message:
# In duplicateCorrelation(object = vwqw, design = design, block = rep(meta$patient,  :
#   Block factor already encoded in the design matrix: setting intrablock correlation to zero.

vwqw <- voomWithQualityWeights(counts = counts, 
                               design = design, 
                               block = rep(meta$patient, each = 2), 
                               correlation = corfit$consensus.correlation)
corfit <- duplicateCorrelation(object = vwqw, 
                               design = design, 
                               block = rep(meta$patient, each = 2)) 

# Warning message:
# In duplicateCorrelation(object = vwqw, design = design, block = rep(meta$patient,  :
#   Block factor already encoded in the design matrix: setting intrablock correlation to zero.

In fact, the Sample (library) is nested within the block (patient) inside the methylation design matrix due to the way modelMatrixMeth is created. Therefore, the warning message makes sense.

head(design)

#   Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 Sample11 Sample12 healthy.liver healthy.muscle
# 1       1       0       0       0       0       0       0       0       0        0        0        0             1              0
# 2       1       0       0       0       0       0       0       0       0        0        0        0             0              0
# 3       0       1       0       0       0       0       0       0       0        0        0        0             0              1
# 4       0       1       0       0       0       0       0       0       0        0        0        0             0              0
# 5       0       0       1       0       0       0       0       0       0        0        0        0             1              0
# 6       0       0       1       0       0       0       0       0       0        0        0        0             0              0
#   sick.liver sick.muscle
# 1          0           0
# 2          0           0
# 3          0           0
# 4          0           0
# 5          0           0
# 6          0           0

Questions

Q1: Is there a correct approach to account for block (i.e., patient) in voomLmFit, given the warning about the block factor being encoded in the design matrix? For example, would it make sense to use design from model.matrix instead of modelMatrixMeth while running the duplicateCorrelation?

design_normal <- model.matrix(~0 + group, meta)
design_normal <- design_normal[rep(seq_len(nrow(design_normal)), each = 2), ] #replicate each row twice
design_methyl <- modelMatrixMeth(~0 + group, meta)

vwqw <- voomWithQualityWeights(counts = counts, 
                               design = design_methyl, # design for RRBS
                               block = rep(meta$patient, each = 2), 
                               correlation = corfit$consensus.correlation)
corfit <- duplicateCorrelation(object = vwqw, 
                               design = design_normal, # regular design
                               block = rep(meta$patient, each = 2))

Q2: Should the warning message "Block factor already encoded in the design matrix: setting intrablock correlation to zero." from duplicateCorrelation also be passed to voomLmFit? Is the current behavior intentional?

Any insights or suggestions would be greatly appreciated. Thank you so much for your help in advance.

limma duplicateCorrelation edgeR modelMatrixMeth voomLmFit • 1.1k views

ADD COMMENT • link 10 weeks ago • updated 9 weeks ago altintas.ali • 0

score 2 · Accepted Answer · 2024-10-19

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Thanks for pointing out this issue. Yes, the fact that limma is using sample effects to convert the count data into proportion data means that it is not possible to estimate a block correlation at the same time. My previous advice to you about this was too simple.

No, you can't just double the rows of the "normal" design matrix, as in your code suggestion. The fact that the methylated and unmethylated counts for the same sample are negatively correlated will overwhelm the intrablock correlation. The code you give will yield a correlation that is too low and possibly negative.

On the other hand, I think you could estimate the intrablock correlation by fitting the "normal" design matrix to the methylated counts alone, i.e., just subset the counts matrix to every second column, and then feed the resulting correlation into the full methylation analysis with both methylated and unmethylated columns. This is essentially a statistical research idea though. I think it should work but I have not tried it or done any checking.

I am have been suppressing warnings from duplicateCorrelation() to voomLmFit() because I didn't want the warnings to appear more than once as voomLmFit() iterates. I have now made an update to voomLmFit() on the developmental version of edgeR so that the duplicateCorrelation() warning will be passed on by voomLmFit(), but just once.

ADD COMMENT • link 10 weeks ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon Smyth,

Thank you so much for your detailed response and for updating voomLmFit warning messages inherited from duplicateCorrelation.

I greatly appreciate that you pointed out the flaw on my suggestion of duplicating the rows of normal design matrix. I have been reflecting on your suggestion about estimating the intra-block correlation by fitting the normal design to "methylated" count matrix. I was wondering if there is any particular reason for using the methylated counts specifically, as opposed to the unmethylated counts? I assume there is no critical difference between the two.

Furthermore, would it be reasonable to consider using M-values (or beta-values) since these metrics incorporate information from both methylated and unmethylated counts? To explore further, I tried to simulate this by creating a methylation matrix by adding batch (patient) effects to M-values:

library(edgeR)
# Experimental design ----
meta <- data.frame(library = as.factor(1:12),
                   patient = as.factor(rep(1:6, each = 2)),
                   disease = rep(c("healthy", "sick"), each = 6),
                   tissue = rep(c("liver","muscle"), times = 6))
meta$group <- paste(meta$disease, meta$tissue, sep = ".")

# Toy data using M-values with patient batch effect ----
set.seed(123)
M <- matrix(data = rnorm(n = 12000), ncol = nrow(meta))
batch_effect <- matrix(rnorm(n = 6000), ncol = 6)
M <- M + batch_effect[, as.integer(meta$patient)]
plotMDS(M, labels = meta$patient) # confirms batch effect

coverage <- matrix(data = round(runif(n = 12000, min = 1, max = 100)), ncol = nrow(meta))
m2beta <- function(m){(2^m)/(2^m + 1)}
B <- m2beta(M)
me_counts <- round(B * coverage)
un_counts <- round((1-B) * coverage)

meID <- seq(1, nrow(meta)*2, by = 2)
unID <- seq(2, nrow(meta)*2, by = 2)
methyl_matrix <- matrix(NA, nrow = nrow(M), ncol = ncol(M)*2)
methyl_matrix[, meID] <- me_counts
methyl_matrix[, unID] <- un_counts

counts <- methyl_matrix
cols <- expand.grid(c("Me","Un"), meta$library)
colnames(counts) <- paste(cols[,2], cols[,1], sep = "_")
rownames(counts) <- paste("CpG", 1:nrow(counts), sep = "_")

Then I re-calculated both methylated (me_count) and unmethylated (un_count) counts as well as M (m_values) and beta (beta_values):

# Calculate input matrices from toy data ---- 
me_counts <- counts[,meID]
un_counts <- counts[,unID]
prior <- 2
m_values <- log2(me_counts + prior) - log2(un_counts + prior)
beta_values <- me_counts/(me_counts + un_counts)

Here are what the MDS plots look like for each matrix which confirm the presence of batch effect:

## MDS plot using different matrices ----
par(mfrow = c(2,2))
plotMDS(me_counts, labels = meta$patient, main = "Methylated counts")
plotMDS(un_counts, labels = meta$patient, main = "Unmethylated counts")
plotMDS(m_values, labels = meta$patient, main = "M-values")
plotMDS(beta_values, labels = meta$patient, main = expression(bold(beta*"-values")))

MDSplots

Finally, I calculated the intra-block correlation using duplicateCorrelation for each data type:

# duplicateCorrelation ----
design <- model.matrix(~ 0 + group, meta)
colnames(design) <- gsub("group", "", colnames(design))

## methylated counts ----
corfit <- duplicateCorrelation(object = me_counts, 
                               design = design, 
                               block = meta$patient) 
corfit$consensus.correlation
#> [1] 0.1300098

## unmethylated counts ----
corfit <- duplicateCorrelation(object = un_counts, 
                               design = design, 
                               block = meta$patient) 
corfit$consensus.correlation
#> [1] 0.1434293

## m-values ---- 
corfit <- duplicateCorrelation(object = m_values, 
                               design = design, 
                               block = meta$patient) 
corfit$consensus.correlation
#> [1] 0.4814286

## beta-values ---- 
corfit <- duplicateCorrelation(object = beta_values, 
                               design = design, 
                               block = meta$patient) 
corfit$consensus.correlation
#> [1] 0.4914304

Considering both batch effect (batch_effect) and M-values (M) were sampled from similar normal distributions (mean = 0, sd = 1), I guess the intra-block correlation coefficient should be theoretically 0.5 in this case, right? If that is the case, I believe using M-values or beta-values for duplicateCorrelation would be suitable for the downstream analysis. I hope I am not missing something fundamental here and I truly appreciate your feedback.

Once again, thank you so much for your continued help, Gordon Smyth

ADD REPLY • link 9 weeks ago altintas.ali • 0

1

Entering edit mode

It might be ok to estimate the intrablock correlation from the M-values. But then, if you do that, you may as well do the whole analysis in limma on the M-values.

ADD REPLY • link 9 weeks ago Gordon Smyth 52k

0

Entering edit mode

Thanks a lot for your feedback, Gordon Smyth

If I may, one final question to follow up on your comment limma analysis using M-values: how can I incorporate the coverage information to the analysis? Would it be enough to pass coverage matrix to arrayWeights() followed by duplicateCorrelation() and lmFit()? Alternatively, should the coverage information be passed to weights parameter of either duplicateCorrelation or lmFit()? Below are 3 possible approaches I am considering:

### Approach-1: arrayWeights() uses coverage ---- 
aw <- arrayWeights(object = coverage, design = design) 
corfit <- duplicateCorrelation(object = m_values, design = design, block = meta$patient, weights = aw)
fit <- lmFit(object = m_values, design = design, weights = aw, 
             block = meta$patient, correlation = corfit$consensus.correlation)

### Approach-2: duplicateCorrelations()'s weights = coverage ----
corfit <- duplicateCorrelation(object = m_values, design = design, block = meta$patient, weights = coverage)
fit <- lmFit(object = m_values, design = design, 
             block = meta$patient, correlation = corfit$consensus.correlation)  

### Approach-3: lmFit()'s weights = coverage ----
corfit <- duplicateCorrelation(object = m_values, design = design, block = meta$patient)
fit <- lmFit(object = m_values, design = design, weights = coverage, 
              block = meta$patient, correlation = corfit$consensus.correlation)

Any of these approaches will be followed with:

eb <- eBayes(fit = fit, trend = TRUE, robust = TRUE)

Would you recommend one of these approaches for incorporating coverage most effectively, or is there a preferred strategy that better aligns with limma best practices?

ADD REPLY • link 9 weeks ago altintas.ali • 0

1

Entering edit mode

No, none of those approaches would be appropriate ways to incorporate coverage.

I suggest treating the methylation counts like a two-color microarray:

M <- log2(me_counts + 0.5) - log2(un_counts + 0.5)
A <- log2(me_counts + un_counts + 1)
MA <- new("MAList",list(M=M,A=A))
fit <- lmFit(MA, design=design)
fit <- eBayes(fit, trend=TRUE)

This limma approach will be somewhat more conservative than the edgeR analysis in our workflow, but perhaps rightly so.

ADD REPLY • link 9 weeks ago Gordon Smyth 52k

0

Entering edit mode

This is such an elegant showcase of two-color microarray data analysis. Thank you so much for your great help and valuable feedback.

ADD REPLY • link 9 weeks ago altintas.ali • 0