Question

edgeR and removeBatchEffect()

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 6 months ago

Palo Alto, CA, USA

Dear all, I would appreciate any comments on the R code below that :

-- uses edgeR to measure the differential expression between siCTL and siPROTEIN

-- uses removeBatchEffect() function to correct for batch effects.

The question would be : is the correct R code that corrects for batch effect first, and subsequently calls for differential expression ? Thank you ! I am also using a paired-design for our samples, as the subjects have been sequenced at different times.

eset <- read.delim("mm8.genes.no.adj", row.names="Symbol")

group <- factor(c("siCTL","siCTL","siPROTEIN","siPROTEIN"))
group <- relevel(group,ref="siCTL")

subject <- factor(c(1,2,1,2))
design <- model.matrix(~group+subject)

y <- DGEList(counts=eset,group=group)

keep <- rowSums(cpm(y) > 0.5) >= 4

y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)

logCPM <- cpm(y,log=TRUE,prior.count=3)
plotMDS(logCPM, col=as.numeric(subject))

logCPM <- removeBatchEffect(logCPM, batch=subject)
plotMDS(logCPM, col=as.numeric(subject))

fit <- lmFit(logCPM, design)
fit <- eBayes(fit,trend=TRUE, robust=TRUE)

results <- topTable(fit, coef=2, adjust="fdr", number=Inf)

write.table(results, file="")

edger limma • 5.0k views

ADD COMMENT • link updated 7.5 years ago by Gordon Smyth 50k • written 7.5 years ago by Bogdan ▴ 670

score 2 · Answer 1 · 2016-10-01

Just include the batch factor in the design matrix (as you've already done) and use edgeR's GLM methods, this will automatically account for any additive batch effect:

y <- estimateDisp(y, design, robust=TRUE) # after calcNormFactors
fit <- glmFit(y, design)
res <- glmLRT(fit, coef=2) # or use glmQLFit/glmQLFTest

If you try to remove the batch effect manually, you'll be overstating the residual degrees of freedom when you go on to perform the linear modelling. You're also using log-CPMs rather than modelling the counts directly, which will be suboptimal as it doesn't account for the mean-variance relationship of the count data.

Also, your filter doesn't make a lot of sense, because genes that are silent in one condition and expressed in the other condition will be filtered out. This will remove genes with the strongest evidence for DE. Perhaps you should set:

keep <- rowSums(cpm(y) > 0.5) >= 2

Edit: Gordon just pointed it out to me: the residual d.f. thing isn't a problem here, as you've already blocked on the batch factor in the design matrix, so the residual d.f. calculation in lmFit and eBayes will be correct. But while it's not wrong, it is unnecessary, so there's no point in doing it. (Removing the batch effect manually would only make a difference if you did not include the batch factor in the design matrix - in which case, my comment about the residual d.f. applies. So don't do that either.)

score 1 · Answer 2 · 2016-10-02

Well, there are a few points that need to be clarified. First, your code is actually using the limma package for differential expression analysis rather than the edgeR package. You are only using edgeR for filtering and normalization (the cpm and calcNormFactors functions). Of course you could switch to edgeR instead if you want (see Aaron's answer), but your limma approach should also be fine.

Second, your experiment doesn't actually have a batch effect. Baseline differences between subjects are already factored into the paired design, so subjects are already expected to be different. The fact that they were sequenced at different times, therefore, does not constitute a batch effect.

There is no need to call removeBatchEffect(), and indeed the help page for removeBatchEffect says that this function should not be used in conjunction with a differential expression analysis. If you had passed the design matrix to removeBatchEffect:

logCPM <- removeBatchEffect(logCPM, batch=subject, design=design)

then removeBatchEffect would have, very sensibly, left logCPM unchanged and given you a warning.

Since you are using removeBatchEffect() to remove the effect of a factor that is already in your design matrix, I think the DE gene list from topTable() will be virtually unchanged when you omit the removeBatchEffect step.