Question: RNA-seq Batch Correction
0
15 months ago by
cvu0
cvu0 wrote:

Hello All,

I have an RNA-seq dataset and I've performed PCA analysis to check batch effect using package edgeR. and have used raw counts from cuffnorm output.

here is my code

library(edgeR)
dge1 <- DGEList(counts=a1[(-1)])
dge1 <- calcNormFactors(dge1)
logCPM1 <- cpm(dge1,log=TRUE,prior.count=5)
batch=c(rep("batch1",8), rep("batch2",8))
logCPM1 <- removeBatchEffect(logCPM1, batch=batch)
B.res = prcomp(logCPM1, scale = TRUE)
pc.s = summary(B.res)$importance[2,1:2] pc1.var = round(pc.s[["PC1"]],2) pc2.var = round(pc.s[["PC2"]],2) pdf(file = "after1_125.pdf") plot(B.res$rotation[,1:2], main = "After Batch Correction",  xlab = paste("PC1 (variance:",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance: ",pc2.var*100,"%)", sep =""), col=resp, pch = 16, cex = .9)
dev.off()

From PCA plot I could see there is little batch effect and I have corrected batch effect. Now I want to use batch corrected data for further analysis in cuffdiff. Is this possible? If yes how? or is Tuxedo Suite takes care of batch correction?

Please suggest me tools or workflows which uses batch corrected data for RNA-seq analysis.

Any help would be appreciated.
Thanks you.

modified 15 months ago by Aaron Lun23k • written 15 months ago by cvu0
1
15 months ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

I won't spreak for other analyses you might want to do with RNA-seq data; but for differential expression analyses, you should not use batch-corrected data. This is because the batch correction does not account for the uncertainty of the estimation of the batch effect, which usually results in anticonservativeness (i.e., p-values that are lower than they should be, too many false positives). Rather, you should include the batch as a blocking factor in the statistical model. This can be easily done with GLMs, see Section 3.4.3 of the edgeR user's guide.

Also, edgeR doesn't perform PCA, you're actually using the prcomp function from stats.

Thank you Aaron
I am also inclined to believe that batch corrected data should not be used for differential expression analysis.

If you could share your thoughts on these, that would be great.
=================================================================
Comment:
As far as the anticonservativeness goes, I think the jury is still divided
Example:
This study states that the correction of batch factors
"it did not increase the number of small p values (e.g., p values < 0.05), but reversely, the small p values are getting less."
https://www.researchgate.net/publication/304497610_Evaluation_of_Methods_in_Removing_Batch_Effects_on_RNA-seq_Data
=================================================================
Question:
if we apply "batch as a blocking factor"
Should this blocking factor be applied for the samples of comparing combination?
or
For all samples? and then use the output data for differential expression?
or
am I getting this wrong?

Example:
I have 24 samples, with 3 batches of 8 samples each.
One specific combination, say A_vs_B, has 3 replicates each (6 samples)

Now should I apply the batch-correction using GLM for the combination? and then perform differential expression analysis?
or
Should I fist corrected the data of all 24 samples, and perform the differential analysis for A_vs_B? and so on so forth for all combination
=================================================================
Thanks a lot for your inputs
really appreciate the time

The quote you've taken from that paper seems to be showing what happens before correction/modelling of the batch factors. Obviously, if you simulate data for a balanced design with a batch effect, and then fail to model that batch effect, you will inflate the variance and increase conservativeness, resulting in a skew towards large p-values. This is a different problem to what I was describing, relating to the use of batch-corrected data for DE analysis.

There is no argument that using pre-corrected data will fail to propagate the uncertainty of the batch effect estimates. Failing to account for uncertainty will generally result in anticonservativeness during hypothesis testing. (I cannot think of a specific case where failing to account for uncertainty would result in conservativeness, but I suppose it might be possible.)

For your second question; I don't understand what you're confused about. Try reading Section 3.4.3 and the case study in Section 4.2 of the edgeR user's guide; I'm not going to recite the documentation here.