I need to remove batch effect between two RNA-seq datasets and get the corrected expression profile for downstream analysis, such as clustering. One data is from TCGA and the other is provided with only FPKM/TPM values. However,the current tools dealing with bulk RNA-seq data are count-based(ComBat_seq, svaseq and RUVseq) and other tools, such as removeBatchEffects, ComBat and sva, are designed for microarray data . Is there any way to solve my problem? And if I also have microarray data, can I remove batch effect between RNA-seq data and microarray data?
This has been asked many times before, you may want to browse biostars and this forum plus google for answers. Generally, you cannot just collect random samples from the internet and expect to meaningfully combine them, especially if these are from completely different labs and batch is confounded by condition (or celltype or whatever your group information is). If you start from raw counts in RNA-seq (assuming experimental design is not confounded) then people often use ComBat-Seq from sva, or removeBatchEffects using the normalized counts on the log scale. You most likely cannot combine RNA-seq and microarrays directly on the count/intensity level, these are completely different technologies with unique characteristics. Maybe some kind of rank-based meta-analysis would serve you better, again assuming this is not confounded by technology which it most likely is.
I have browsed many answers before but most of them did not focus on whether the log(FPKM/RPKM) values could be used directly for the input of batch efffect removing tools. Samples from two datasets are same tumor types and I just want to remove the batch effect from the data sources. In fact I had tried this by taking log(TPM), converted using FPKM, as input for ComBat and it seems worked that samples were not clustered by datasets anymore. But I am confused whether the result is reliable as ComBat is suitable for mircoarray data.
And you are right that comparing RNA-seq and microarrays directly is unreasonable
The issue has nothing to do with what mathematical manipulations you might have subjected your counts to. It's a fundamental fact of current RNASeq library preps. They are strongly affected by batch effect. And you can't just remove it like you'd pick the pepperoni off a pizza.
Just because ComBat gave you a result that superficially looks like you want it to, that is not at all a guarantee that its manipulations are valid.
Thanks.
I have browsed many answers before but most of them did not focus on whether the log(FPKM/RPKM) values could be used directly for the input of batch efffect removing tools. Samples from two datasets are same tumor types and I just want to remove the batch effect from the data sources. In fact I had tried this by taking log(TPM), converted using FPKM, as input for ComBat and it seems worked that samples were not clustered by datasets anymore. But I am confused whether the result is reliable as ComBat is suitable for mircoarray data.
And you are right that comparing RNA-seq and microarrays directly is unreasonable
The issue has nothing to do with what mathematical manipulations you might have subjected your counts to. It's a fundamental fact of current RNASeq library preps. They are strongly affected by batch effect. And you can't just remove it like you'd pick the pepperoni off a pizza.
Just because ComBat gave you a result that superficially looks like you want it to, that is not at all a guarantee that its manipulations are valid.