Question: edgeR and removeBatchEffect()
gravatar for Bogdan
3.1 years ago by
Palo Alto, CA, USA
Bogdan580 wrote:

Dear all, I would appreciate any comments on the R code below that :

-- uses edgeR to measure the differential expression between siCTL and siPROTEIN

-- uses removeBatchEffect() function to correct for batch effects.

The question would be : is the correct R code that corrects for batch effect first, and subsequently calls for differential expression ? Thank you ! I am also using a paired-design for our samples, as the subjects have been sequenced at different times. 

eset <- read.delim("", row.names="Symbol")

group <- factor(c("siCTL","siCTL","siPROTEIN","siPROTEIN"))
group <- relevel(group,ref="siCTL")

subject <- factor(c(1,2,1,2))
design <- model.matrix(~group+subject)

y <- DGEList(counts=eset,group=group)

keep <- rowSums(cpm(y) > 0.5) >= 4

y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)

logCPM <- cpm(y,log=TRUE,prior.count=3)
plotMDS(logCPM, col=as.numeric(subject))

logCPM <- removeBatchEffect(logCPM, batch=subject)
plotMDS(logCPM, col=as.numeric(subject))
fit <- lmFit(logCPM, design)
fit <- eBayes(fit,trend=TRUE, robust=TRUE)

results <- topTable(fit, coef=2, adjust="fdr", number=Inf)

write.table(results, file="")

limma edger • 1.9k views
ADD COMMENTlink modified 3.1 years ago by Gordon Smyth39k • written 3.1 years ago by Bogdan580
Answer: edgeR and removeBatchEffect()
gravatar for Aaron Lun
3.1 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Just include the batch factor in the design matrix (as you've already done) and use edgeR's GLM methods, this will automatically account for any additive batch effect:

y <- estimateDisp(y, design, robust=TRUE) # after calcNormFactors
fit <- glmFit(y, design)
res <- glmLRT(fit, coef=2) # or use glmQLFit/glmQLFTest

If you try to remove the batch effect manually, you'll be overstating the residual degrees of freedom when you go on to perform the linear modelling. You're also using log-CPMs rather than modelling the counts directly, which will be suboptimal as it doesn't account for the mean-variance relationship of the count data.

Also, your filter doesn't make a lot of sense, because genes that are silent in one condition and expressed in the other condition will be filtered out. This will remove genes with the strongest evidence for DE. Perhaps you should set:

keep <- rowSums(cpm(y) > 0.5) >= 2

Edit: Gordon just pointed it out to me: the residual d.f. thing isn't a problem here, as you've already blocked on the batch factor in the design matrix, so the residual d.f. calculation in lmFit and eBayes will be correct. But while it's not wrong, it is unnecessary, so there's no point in doing it. (Removing the batch effect manually would only make a difference if you did not include the batch factor in the design matrix - in which case, my comment about the residual d.f. applies. So don't do that either.)

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Aaron Lun25k
Answer: edgeR and removeBatchEffect()
gravatar for Gordon Smyth
3.1 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

Well, there are a few points that need to be clarified. First, your code is actually using the limma package for differential expression analysis rather than the edgeR package. You are only using edgeR for filtering and normalization (the cpm and calcNormFactors functions). Of course you could switch to edgeR instead if you want (see Aaron's answer), but your limma approach should also be fine.

Second, your experiment doesn't actually have a batch effect. Baseline differences between subjects are already factored into the paired design, so subjects are already expected to be different. The fact that they were sequenced at different times, therefore, does not constitute a batch effect.

There is no need to call removeBatchEffect(), and indeed the help page for removeBatchEffect says that this function should not be used in conjunction with a differential expression analysis. If you had passed the design matrix to removeBatchEffect:

logCPM <- removeBatchEffect(logCPM, batch=subject, design=design)

then removeBatchEffect would have, very sensibly, left logCPM unchanged and given you a warning.

Since you are using removeBatchEffect() to remove the effect of a factor that is already in your design matrix, I think the DE gene list from topTable() will be virtually unchanged when you omit the removeBatchEffect step.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Gordon Smyth39k

Dear Aaron, and Gordon, thank you for your comments and help. I will get back to you with a more general piece of code for removing of the batch effects, and will ask for your comments, if you do not mind. many thanks !

ADD REPLYlink written 3.1 years ago by Bogdan580
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 198 users visited in the last hour