edgeR and removeBatchEffect()
2
0
Entering edit mode
Bogdan ▴ 640
@bogdan-2367
Last seen 5 weeks ago
Palo Alto, CA, USA

Dear all, I would appreciate any comments on the R code below that :

-- uses edgeR to measure the differential expression between siCTL and siPROTEIN

-- uses removeBatchEffect() function to correct for batch effects.

The question would be : is the correct R code that corrects for batch effect first, and subsequently calls for differential expression ? Thank you ! I am also using a paired-design for our samples, as the subjects have been sequenced at different times. 


eset <- read.delim("mm8.genes.no.adj", row.names="Symbol")

group <- factor(c("siCTL","siCTL","siPROTEIN","siPROTEIN"))
group <- relevel(group,ref="siCTL")

subject <- factor(c(1,2,1,2))
design <- model.matrix(~group+subject)

y <- DGEList(counts=eset,group=group)

keep <- rowSums(cpm(y) > 0.5) >= 4

y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)

logCPM <- cpm(y,log=TRUE,prior.count=3)
plotMDS(logCPM, col=as.numeric(subject))

logCPM <- removeBatchEffect(logCPM, batch=subject)
plotMDS(logCPM, col=as.numeric(subject))
 
fit <- lmFit(logCPM, design)
fit <- eBayes(fit,trend=TRUE, robust=TRUE)

results <- topTable(fit, coef=2, adjust="fdr", number=Inf)

write.table(results, file="")

edger limma • 3.5k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 9 hours ago
The city by the bay

Just include the batch factor in the design matrix (as you've already done) and use edgeR's GLM methods, this will automatically account for any additive batch effect:

y <- estimateDisp(y, design, robust=TRUE) # after calcNormFactors
fit <- glmFit(y, design)
res <- glmLRT(fit, coef=2) # or use glmQLFit/glmQLFTest

If you try to remove the batch effect manually, you'll be overstating the residual degrees of freedom when you go on to perform the linear modelling. You're also using log-CPMs rather than modelling the counts directly, which will be suboptimal as it doesn't account for the mean-variance relationship of the count data.

Also, your filter doesn't make a lot of sense, because genes that are silent in one condition and expressed in the other condition will be filtered out. This will remove genes with the strongest evidence for DE. Perhaps you should set:

keep <- rowSums(cpm(y) > 0.5) >= 2

Edit: Gordon just pointed it out to me: the residual d.f. thing isn't a problem here, as you've already blocked on the batch factor in the design matrix, so the residual d.f. calculation in lmFit and eBayes will be correct. But while it's not wrong, it is unnecessary, so there's no point in doing it. (Removing the batch effect manually would only make a difference if you did not include the batch factor in the design matrix - in which case, my comment about the residual d.f. applies. So don't do that either.)

ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

Well, there are a few points that need to be clarified. First, your code is actually using the limma package for differential expression analysis rather than the edgeR package. You are only using edgeR for filtering and normalization (the cpm and calcNormFactors functions). Of course you could switch to edgeR instead if you want (see Aaron's answer), but your limma approach should also be fine.

Second, your experiment doesn't actually have a batch effect. Baseline differences between subjects are already factored into the paired design, so subjects are already expected to be different. The fact that they were sequenced at different times, therefore, does not constitute a batch effect.

There is no need to call removeBatchEffect(), and indeed the help page for removeBatchEffect says that this function should not be used in conjunction with a differential expression analysis. If you had passed the design matrix to removeBatchEffect:

logCPM <- removeBatchEffect(logCPM, batch=subject, design=design)

then removeBatchEffect would have, very sensibly, left logCPM unchanged and given you a warning.

Since you are using removeBatchEffect() to remove the effect of a factor that is already in your design matrix, I think the DE gene list from topTable() will be virtually unchanged when you omit the removeBatchEffect step.

ADD COMMENT
0
Entering edit mode

Dear Aaron, and Gordon, thank you for your comments and help. I will get back to you with a more general piece of code for removing of the batch effects, and will ask for your comments, if you do not mind. many thanks !

ADD REPLY

Login before adding your answer.

Traffic: 424 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6