Question

model matrix for edgeR paired-design with batch effect

0

Entering edit mode

NGS-seeker • 0

@ngs-seeker-17938

Last seen 6.4 years ago

Hi I have a quick question about a paired samples and batch effect correction. Especially I am seeking a solution for edgeR or limma/voom.

A few days ago, I fount the following post.

correction for batch effects and paired-design of RNA-seq samples

In his case, subject and batch are identical, so "There is no need to do any batch correction. The baseline differences between subjects, including any batch effect, is already accounted for in the paired analysis."

group <- factor(c("wt","wt","wt","ko","ko","ko"))
subject <- as.factor(c(1,2,3,1,2,3))
batch <- as.factor(c(1,2,3,1,2,3))

But how about the following example? In this case, a given subject's wt/ko samples are from different batches.

group <- factor(c("wt","wt","wt","ko","ko","ko"))
subject <- as.factor(c(1,2,3,1,2,3))
batch <- as.factor(c(1,1,1,2,2,1))

What is the best model design to correct batch effect and to predict DEG between WT and KO groups? Could you suggest the model fomula?

edgeR paired samples batch effect • 2.5k views

ADD COMMENT • link updated 6.4 years ago by Aaron Lun ★ 28k • written 6.4 years ago by NGS-seeker • 0

score 1 · Answer 1 · 2018-10-25

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 5 hours ago

The city by the bay

This design is quite poor, it's one sample away from being broken (batch is almost confounded with group). As it is, you can squeeze out 1 residual degree of freedom:

design <- model.matrix(~group + subject + batch)

... which should allow you to compare between WT and KO. This will not give you a lot of power - the dispersion will not be stably estimated, and a lot of information will be spent estimating the subject/batch blocking factors rather than being used to compare between genotypes.

Indeed, a careful analysis of the design indicates that the third WT sample is effectively useless; it will not contribute to dispersion estimation or DE testing, and will only be used by edgeR when computing the average. The two samples in batch 2 will mostly be uninvolved in DE testing, meaning that most of your analysis will rely on a 2 WT vs 1 KO comparison - not great.

ADD COMMENT • link 6.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks, Aaron.

I have another question about the order of coefficients in your formula. You placed the "batch" term in the last. Isn't it testing sig. differences between batches, not between groups. I found your prev. comment here (https://support.bioconductor.org/p/96627/)?

I want to compare two groups (KO -vs- WT) of paired/matched samples after correcting batch effect. How about (~batch + subject + group)? Does this formula work?

BTW, my example was just a toy example, not a real exp design.

ADD REPLY • link 6.4 years ago NGS-seeker • 0

0

Entering edit mode

Question 1: If you were to use the default coef=ncol(design) in glmLRT or glmQLFTest, then yes, you would be testing for DE between batches. But this is easy to change; just set coef or contrasts appropriately to perform the comparison you actually want to do. You shouldn't rely on the default coef anyway, it only works in limited scenarios.

Question 2: See my answer to question 1. Design matrix formulation choices like the order of factors in an additive formula (and whether or not to use an intercept) don't affect the final result as long as you set the contrast correctly, via coef or contrasts. This should be straightforward to do if you really understand what your design matrix means.

ADD REPLY • link 6.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

what if the design was:

group <- factor(c("wt","wt","wt","wt","ko","ko","ko","ko"))
subject <- as.factor(c(1,2,3,4,1,2,3,4))
batch <- as.factor(c(1,1,2,2,1,1,2,2))

ie: one set of individuals sequenced day 1, different set of individuals sequenced day 2

would the design still be: design <- model.matrix(~group + subject + batch)

ADD REPLY • link 5.6 years ago agm • 0