Case Scenario

Question

Re-numbering blocking IDs on paired samples with duplicateCorrelation or design covariate

1

Entering edit mode

altintas.ali • 0

@altintasali-11054

Last seen 8 days ago

Denmark

Hi all,

I have a paired setup where there are healthy and sick individuals. They have received some treatment in a controlled fashion and I am interested in differential gene expression across treatment and disease groups. I was wondering whether I should use edgeR with blocking patient IDs in the design or use voom with duplicateCorrelation. Also, shall I re-number the patient IDs or use them as they are. Please follow below for details:

Case Scenario

# Toy example
meta <- data.frame(Patient = as.factor(rep(1:7, each =3)),
                   Disease = c(rep("Sick", 12), rep("Healthy", 9)),
                   Treatment = rep(c("Placebo","Drug1","Drug2"), 7))
meta$group <- paste(meta$Disease, meta$Treatment, sep = ".")

#    Patient Disease Treatment           group
# 1        1    Sick   Placebo    Sick.Placebo
# 2        1    Sick     Drug1      Sick.Drug1
# 3        1    Sick     Drug2      Sick.Drug2
# 4        2    Sick   Placebo    Sick.Placebo
# 5        2    Sick     Drug1      Sick.Drug1
# 6        2    Sick     Drug2      Sick.Drug2
# 7        3    Sick   Placebo    Sick.Placebo
# 8        3    Sick     Drug1      Sick.Drug1
# 9        3    Sick     Drug2      Sick.Drug2
# 10       4    Sick   Placebo    Sick.Placebo
# 11       4    Sick     Drug1      Sick.Drug1
# 12       4    Sick     Drug2      Sick.Drug2
# 13       5 Healthy   Placebo Healthy.Placebo
# 14       5 Healthy     Drug1   Healthy.Drug1
# 15       5 Healthy     Drug2   Healthy.Drug2
# 16       6 Healthy   Placebo Healthy.Placebo
# 17       6 Healthy     Drug1   Healthy.Drug1
# 18       6 Healthy     Drug2   Healthy.Drug2
# 19       7 Healthy   Placebo Healthy.Placebo
# 20       7 Healthy     Drug1   Healthy.Drug1
# 21       7 Healthy     Drug2   Healthy.Drug2

Approach 1

When Patient is used as a blocking factor in the design, my design matrix is not full rank as expected.

design <- model.matrix(~ 0 + group + Patient, meta)
is.fullrank(design)
# FALSE
nonEstimable(design)
# Patient7

Question 1: Do you think if it is a good approach to drop the non-estimable coefficient in the design matrix as shown below, then proceed with the downstream analysis?

i <- grep(nonEstimable(design), colnames(design))
design2 <- design[,-i]
is.fullrank(design2)
# TRUE

Question 2: A similar case has been shown on edgeR manual's section 3.5 as "Comparisons both between and within subjects" and the Patient IDs are re-numbered within each Disease group. When I try that approach, the matrix is full rank. Do you think if patient ID re-numbering is a better approach than dropping the non-estimable parameter?

meta2 <- data.frame(Patient = as.factor(c(gl(4,3),gl(3,3))) ,
                    Disease = c(rep("Sick", 12), rep("Healthy", 9)),
                    Treatment = rep(c("Placebo","Drug1","Drug2"), 7))
meta2$group <- paste(meta2$Disease, meta2$Treatment, sep = ".")

#    Patient Disease Treatment           group
# 1        1    Sick   Placebo    Sick.Placebo
# 2        1    Sick     Drug1      Sick.Drug1
# 3        1    Sick     Drug2      Sick.Drug2
# 4        2    Sick   Placebo    Sick.Placebo
# 5        2    Sick     Drug1      Sick.Drug1
# 6        2    Sick     Drug2      Sick.Drug2
# 7        3    Sick   Placebo    Sick.Placebo
# 8        3    Sick     Drug1      Sick.Drug1
# 9        3    Sick     Drug2      Sick.Drug2
# 10       4    Sick   Placebo    Sick.Placebo
# 11       4    Sick     Drug1      Sick.Drug1
# 12       4    Sick     Drug2      Sick.Drug2
# 13       1 Healthy   Placebo Healthy.Placebo
# 14       1 Healthy     Drug1   Healthy.Drug1
# 15       1 Healthy     Drug2   Healthy.Drug2
# 16       2 Healthy   Placebo Healthy.Placebo
# 17       2 Healthy     Drug1   Healthy.Drug1
# 18       2 Healthy     Drug2   Healthy.Drug2
# 19       3 Healthy   Placebo Healthy.Placebo
# 20       3 Healthy     Drug1   Healthy.Drug1
# 21       3 Healthy     Drug2   Healthy.Drug2

design2 <- model.matrix(~ 0 + group + Patient, meta2)
is.fullrank(design2)
# TRUE

Question 3: I have parameterized the Disease and Treatment into group as it is easier to interpret while formulating the contrasts. Do you think if I should abandon this approach and use the design proposed on edgeR manual, section 3.5?

design3 <- model.matrix(~Disease+Disease:Patient+Disease:Treatment, meta2)
is.fullrank(design3)
# FALSE
nonEstimable(design3)
# "DiseaseHealthy:Patient4" # Note that it is all zero since there is no Patient4 in Sick patients
i <- grep(nonEstimable(design3), colnames(design3))
design3 <- design3[,-i]
is.fullrank(design3)
# TRUE

Approach 2

As an alternative, I was planning to use duplicateCorrelation:

# Matrix size
nsamples <- 21
ngenes <- 1e4

# counts matrix
set.seed(1)
counts <- matrix(rnbinom(n = ngenes * nsamples, mu=10, size=10), ncol = nsamples)
rownames(counts) <- paste0("Gene", 1:ngenes)
colnames(counts) <- rownames(meta)

# DGEList
y <- DGEList(counts = counts, samples = meta)

# design
design4 <- model.matrix(~ 0 + group, meta)

# voom & duplicateCorrelation
v <- voom(y, design4)
corfit <- duplicateCorrelation(v, design4, block = meta$Patient)
v <- voom(y, design4, block = meta$Patient, correlation = corfit$consensus)
corfit <- duplicateCorrelation(v, design4, block = meta$Patient)
fit <- lmFit(v, design4, block = meta$Patient, correlation = corfit$consensus)

Question 4: Do you think if this approach is better than the other one in this case? In some cases, I receive a warning message (shown below) and Gordon Smyth mentioned that it is ignorable (for example this post). (Please note that I was not able to reproduce the warning message below in this toy example)

Warning message:
In glmgam.fit(dx, dy, coef.start = start, tol = tol, maxit = maxit,  :
  Too much damping - convergence tolerance not achievable

Question 5: I have never seen a use of duplicateCorrelation with re-numbered patient IDs. However, the warning message above seems to disappear when I use the re-numbered patient IDs (as shown in meta2). Is it a good approach to use the re-numbered patient IDs within each disease with duplicateCorrelation?

All the best!

edger limma duplicateCorrelation batch effect • 1.4k views

ADD COMMENT • link updated 5.5 years ago by Aaron Lun ★ 28k • written 5.5 years ago by altintas.ali • 0

score 3 · Accepted Answer · 2019-06-20

tl; dr If you want to directly compare expression between healthy and sick patients, use voom.

Question 1: I would prefer to use the ~ 0 + Patient + Group formulation and to remove the unestimable Healthy.Placebo coefficient. (I'm going to assume you releveled group so that the sick placebo is the first level.) This means that the last 4 coefficients represent log-fold changes of each drug treatment over the placebo within the sick/healthy class, which are fairly easy to interpret. Your current formulation has misleading column names - for example, in your design, Sick.Placebo is not the average of the healthy placebo samples across all patients, but instead represents the expected log-expression in the first sample!

(Note that your design would be fine for comparing between groups within all sick patients, or within all healthy patients, as we've assumed the patient effect is additive. However, the column names may tempt you to set up contrasts between Sick and Healthy coefficients, which would be wrong. By comparison, the coefficients in my formulation already represent log-fold changes relative to placebo within each level of Disease, so these are safe to compare across levels.)

Question 2: You haven't set it up right. Your design assumes that the first sick patient is the same as the first healthy patient. If you want to follow Section 3.5, you need the appropriatePatient:Disease interaction. But this is a pain, so just use the design I pointed out above.

Question 3: No.

Question 4: If you want to directly compare the average expression of, say, Sick.placebo to Healthy.placebo, then you must use voom. If you only care about comparing the effects of drugs between disease levels (e.g., is the log-fold change upon drug 1 treatment in sick patients the same as that in healthy patients?), then you can continue using edgeR.

Question 5: Don't renumber them, see my answer to question 2.