ComBat to remove batch effects from RNA Seq data
Entering edit mode
kerrypop ▴ 30
Last seen 5.0 years ago


I want to use ComBat to remove batch effects from RNA Seq data. Have 2 batches, both consisting of disease and healthy samples, processed at different times. Can I use FPKM normalized data as input for ComBat or do I have to perform some normalization using the raw counts?

Thank you 

combat rnaseq • 6.1k views
Entering edit mode
polemiraza ▴ 70
Last seen 6 months ago

Dear kerrypop,

It depends on your task. If you want calculate differential expression it is better use directly raw counts [you should include batch variable in calculations] - please see Edger package / "Users guide" /section 3.4.3 and 4.2

If you want to perform (for example) unsupervised clustering or build network, yes, its better to normalize data prior batch correction. I know 4 options  which should bring your data closer to a normal distribution and remove skewness (which is essential for ComBat):

1) log2(FPKM + small offset like "1")

2) rlog transformation (use raw counts as input / DESeq2 package)

3) vst transformation (use raw counts as input / DESeq2 package)

4) voom transformation (use raw counts as input / Limma package)

I experimented with all of them [eg. Combat applied prior usnupervised clustering]. Most satysfying results gave me methods "2", "3" and "4". I can't say which one is better but I can say that if you have dozens or hundrets of samples "2" is super slow and "3" is just a little bit faster. Recently voom suprised me by its speed (few minutes compared to hours) and the results of clustering were similar to that achieved by vst (in various datasets).
So I highly recommend you voom transformation prior ComBat.
Please remember to remove lowly expressed genes before running voom (this applies also for rlog, vst and ComBat in general) and don't be scared of negative values in voom transformed data (its natural).





Entering edit mode


Sorry for the delay in my response. I was traveling the past 2 weeks. I greatly appreciate your feedback and am eager to try the different methods.

I previously used the following code to remove lowly expressed genes and wanted to check if it was an appropriate method. 

getVar <- apply(FPKM,1,var)
param <- 1
data_NoVar <- FPKM[getVar > param& !,]

Thank you again!



Login before adding your answer.

Traffic: 193 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6