I have seen many posts about batch effect correction on RNA-seq data. But I'm little confused in using the code for differential analysis. So, asking it here.
I have been using 8 rna-seq samples, 4 NOTCH silenced samples and 4 Controls. The setup is like this
Samples type days
1 Control day1
2 Control day1
3 Control day2
4 Control day2
5 shNotch day2
6 shNotch day2
7 shNotch day1
8 shNotch day1
For my data I did the batch correction and then used the data for an MDS plot to look how they clustered.
library(edgeR)
group <- factor(paste0(columndata$type))
dge <- DGEList(counts=countdata, group=group)
keep <- rowSums(cpm(dge) > 1) >= 3
dge <- dge[keep, , keep.lib.sizes=FALSE]
dge <- calcNormFactors(dge, method = "TMM")
logCPM <- cpm(dge, log=TRUE, prior.count=2)
logCPM_bc <- removeBatchEffect(logCPM, batch=columndata$days)
plotMDS(logCPM_bc)
For differential analysis first I created design matrix like following:
design <- model.matrix(~ 0 + group + columndata$days)
colnames(design) <- c("Control","shNotch", "Days")
contrast.matrix <- makeContrasts(shNotch-Control, levels=design)
And now I was wondering whether to use DGElist object dge
or logCPM_bc
(Batch corrected logCPM) for estimating dispersion estimateDisp
, for glmQLFit
.
Any help? thanks a lot
removeBatchEffect
should not be used for DE analysis but for other downstream analysis, such as PCA. For adjustment of DE analysis, indeed you should includecolumndata$days
in design matrix. Therefore, yourdge
object should be used. This is important because the linear modeling steps should know that it uses one degree of freedom for the adjustment.Yes I understand now. Thanks a lot Mikhael.