Hi all,
I have a paired setup where there are healthy and sick individuals. They have received some treatment in a controlled fashion and I am interested in differential gene expression across treatment and disease groups. I was wondering whether I should use edgeR
with blocking patient IDs in the design or use voom
with duplicateCorrelation
. Also, shall I re-number the patient IDs or use them as they are. Please follow below for details:
Case Scenario
# Toy example
meta <- data.frame(Patient = as.factor(rep(1:7, each =3)),
Disease = c(rep("Sick", 12), rep("Healthy", 9)),
Treatment = rep(c("Placebo","Drug1","Drug2"), 7))
meta$group <- paste(meta$Disease, meta$Treatment, sep = ".")
# Patient Disease Treatment group
# 1 1 Sick Placebo Sick.Placebo
# 2 1 Sick Drug1 Sick.Drug1
# 3 1 Sick Drug2 Sick.Drug2
# 4 2 Sick Placebo Sick.Placebo
# 5 2 Sick Drug1 Sick.Drug1
# 6 2 Sick Drug2 Sick.Drug2
# 7 3 Sick Placebo Sick.Placebo
# 8 3 Sick Drug1 Sick.Drug1
# 9 3 Sick Drug2 Sick.Drug2
# 10 4 Sick Placebo Sick.Placebo
# 11 4 Sick Drug1 Sick.Drug1
# 12 4 Sick Drug2 Sick.Drug2
# 13 5 Healthy Placebo Healthy.Placebo
# 14 5 Healthy Drug1 Healthy.Drug1
# 15 5 Healthy Drug2 Healthy.Drug2
# 16 6 Healthy Placebo Healthy.Placebo
# 17 6 Healthy Drug1 Healthy.Drug1
# 18 6 Healthy Drug2 Healthy.Drug2
# 19 7 Healthy Placebo Healthy.Placebo
# 20 7 Healthy Drug1 Healthy.Drug1
# 21 7 Healthy Drug2 Healthy.Drug2
Approach 1
When Patient
is used as a blocking factor in the design, my design matrix is not full rank as expected.
design <- model.matrix(~ 0 + group + Patient, meta)
is.fullrank(design)
# FALSE
nonEstimable(design)
# Patient7
Question 1: Do you think if it is a good approach to drop the non-estimable coefficient in the design matrix as shown below, then proceed with the downstream analysis?
i <- grep(nonEstimable(design), colnames(design))
design2 <- design[,-i]
is.fullrank(design2)
# TRUE
Question 2: A similar case has been shown on edgeR manual's section 3.5 as "Comparisons both between and within subjects" and the Patient
IDs are re-numbered within each Disease
group. When I try that approach, the matrix is full rank. Do you think if patient ID re-numbering is a better approach than dropping the non-estimable parameter?
meta2 <- data.frame(Patient = as.factor(c(gl(4,3),gl(3,3))) ,
Disease = c(rep("Sick", 12), rep("Healthy", 9)),
Treatment = rep(c("Placebo","Drug1","Drug2"), 7))
meta2$group <- paste(meta2$Disease, meta2$Treatment, sep = ".")
# Patient Disease Treatment group
# 1 1 Sick Placebo Sick.Placebo
# 2 1 Sick Drug1 Sick.Drug1
# 3 1 Sick Drug2 Sick.Drug2
# 4 2 Sick Placebo Sick.Placebo
# 5 2 Sick Drug1 Sick.Drug1
# 6 2 Sick Drug2 Sick.Drug2
# 7 3 Sick Placebo Sick.Placebo
# 8 3 Sick Drug1 Sick.Drug1
# 9 3 Sick Drug2 Sick.Drug2
# 10 4 Sick Placebo Sick.Placebo
# 11 4 Sick Drug1 Sick.Drug1
# 12 4 Sick Drug2 Sick.Drug2
# 13 1 Healthy Placebo Healthy.Placebo
# 14 1 Healthy Drug1 Healthy.Drug1
# 15 1 Healthy Drug2 Healthy.Drug2
# 16 2 Healthy Placebo Healthy.Placebo
# 17 2 Healthy Drug1 Healthy.Drug1
# 18 2 Healthy Drug2 Healthy.Drug2
# 19 3 Healthy Placebo Healthy.Placebo
# 20 3 Healthy Drug1 Healthy.Drug1
# 21 3 Healthy Drug2 Healthy.Drug2
design2 <- model.matrix(~ 0 + group + Patient, meta2)
is.fullrank(design2)
# TRUE
Question 3: I have parameterized the Disease
and Treatment
into group
as it is easier to interpret while formulating the contrasts. Do you think if I should abandon this approach and use the design proposed on edgeR manual, section 3.5?
design3 <- model.matrix(~Disease+Disease:Patient+Disease:Treatment, meta2)
is.fullrank(design3)
# FALSE
nonEstimable(design3)
# "DiseaseHealthy:Patient4" # Note that it is all zero since there is no Patient4 in Sick patients
i <- grep(nonEstimable(design3), colnames(design3))
design3 <- design3[,-i]
is.fullrank(design3)
# TRUE
Approach 2
As an alternative, I was planning to use duplicateCorrelation
:
# Matrix size
nsamples <- 21
ngenes <- 1e4
# counts matrix
set.seed(1)
counts <- matrix(rnbinom(n = ngenes * nsamples, mu=10, size=10), ncol = nsamples)
rownames(counts) <- paste0("Gene", 1:ngenes)
colnames(counts) <- rownames(meta)
# DGEList
y <- DGEList(counts = counts, samples = meta)
# design
design4 <- model.matrix(~ 0 + group, meta)
# voom & duplicateCorrelation
v <- voom(y, design4)
corfit <- duplicateCorrelation(v, design4, block = meta$Patient)
v <- voom(y, design4, block = meta$Patient, correlation = corfit$consensus)
corfit <- duplicateCorrelation(v, design4, block = meta$Patient)
fit <- lmFit(v, design4, block = meta$Patient, correlation = corfit$consensus)
Question 4: Do you think if this approach is better than the other one in this case? In some cases, I receive a warning message (shown below) and Gordon Smyth mentioned that it is ignorable (for example this post). (Please note that I was not able to reproduce the warning message below in this toy example)
Warning message:
In glmgam.fit(dx, dy, coef.start = start, tol = tol, maxit = maxit, :
Too much damping - convergence tolerance not achievable
Question 5: I have never seen a use of duplicateCorrelation
with re-numbered patient IDs. However, the warning message above seems to disappear when I use the re-numbered patient IDs (as shown in meta2
). Is it a good approach to use the re-numbered patient IDs within each disease with duplicateCorrelation
?
All the best!
Dear Aaron,
Thank you very much for your detailed answers. Your solution for Question 1 (
~ 0 + Patient + Group
) is very neat. Just for me to understand it better: Let's assume that we have the same individuals sampled before and after they are sick (for some weird reason):Would that be ok to use the design
0 + Group + Patient
in this case as Patient 1 will be both in Disease and Treatment group (considering that Patient 4 is notrelevel
ed asref
erence)?Yes, if you have samples from both sick and healthy versions of the same patient, treated with the same series of drugs, that would be fine. Each
Group
coefficient now represents the expected log-expression of that disease/drug combination within patient 1. (We're assuming that the patient effect is additive, so these coefficients implicitly use information across all patients when they are contrasted to each other, even though in absolute terms they only represent expression in patient 1.) As you've noted, this would be a pretty uncommon situation, though.