Possible batch effect correction with RNA-Seq data when the batches do not contain both biological conditions for DE analysis
Entering edit mode
svlachavas ▴ 830
Last seen 6 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

I would like to ask for a rather challenging scenario of putative batch effect correction for differential expression analysis. Very briefly, I have pre-processed (TMM normalization, log2CPM transformation) a large number of RNA-Seq samples from patients with different solid tumors, that have undergone conventional therapy prior sequencing. The main limitation, is that due to some initial purposes of the study, control or healthy samples did not included. My current putative goal, is to assess the possibility of performing differential expression analysis, based on some reference tissue samples.

On this premise, one "naive" approach would be from GTEx to download gene counts for any tissue types that are also included in the cancer data, and pre-process similarly. Thus my crucial questions are the following:

1) Is possible to perform any batch effect correction approach ? like removeBatchEffect from limma or even ComBat ? however, for differential expression analysis this would be erroneous due to the overestimation of residuals of freedom and relative inflation of statistics and variance estimation ? and additionally, as the 2 batches of samples do not contain both biological conditions, technical variance would dominate any correction for contribution of biological signal ?

2) Alternatively, blocking through the design matrix and including the batch information into the linear model would still be problematic, because any of the samples from either condition, would be totally "utilized" to estimate the batch effect ? and could not be used for DE ?

Is there a possibility for any alternative solution, or due to this experimental design no DE analysis could be performed ?

Thank you in advance,


batch effect limma rna seq DE • 3.1k views
Entering edit mode
Last seen 13 months ago
United States

If you have absolutely zero normal/control samples in the original study, it seems like you are almost certainly hosed since there's no real way to identify differences in batch vs differences in the biological condition of interest.

There are almost certainly more examples in the literature of combining different datasets for a singular analysis, but here are two that I recall that combined GTEx "normals" with the TCGA data to perform a tumor-vs-normal analysis. I imagine both studies used the matched normals in the TCGA data to integrate the GTEx data for that tissue, but maybe you can find something useful:

  1. Unifying cancer and normal RNA sequencing data from different sources
  2. Comprehensive analysis of normal adjacent to tumor transcriptomes
Entering edit mode

Dear Steve,

thank you for your valuable comments and feedback !! I will take a detailed look on the papers you mentioned-actually, i have found also some other examples in the literature based on the TCGAbiolinks R package about various case studies:


and the relative publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006701

but still, I haven't been able to figure out if the data merging included cancer samples from one study with only normal from another data repository, like GTEx-nevertheless, perhaps the example you mention shares much more "common ground" for the integration between GTEx and TCGA data-

my last important question that I would like your opinion:

actually, i managed to extract clinical information about the cancer patients, on which cancer_type-localization information for each cancer sample-for example, brain, pancreas, CRC etc. You think, that an approach that would use:

1) obtain GTEx data for only from the same tissue origin, like brain, pancreas, etc. Pre-process-if possible-using the same normalization and transformation approaches. Final step, merge based on common gene symbols.

2) Then, as a last step use limma or edgeR but with a paired design matrix, essentially comparing each cancer group from the batch of cancer only samples, with the normal samples belonging to the same tissue ? would that in your opinion reduce the batch effect ?

Entering edit mode

I hate to bo the bearer of bad news, but if I'm being absolutely honest: if you have no "common ground" samples (as you put it (nicely done)), I'd have little to no hope in any analysis approach giving me reliable-enough data that would then make me want to then spend the mountains of time required to try and make sense of some biology we may have uncovered.

To provide some context: I've seen very different expression profiles generated from the same samples by the same person when these RNA libraries were generated using different protocols (the goal of this exercise was to compare the performance of one library kit vs the other).

When these are then compared against another set of samples, each generated with the matched library prep, the logFC's calculated between the groups then become comparable again -- so these technical effects largely cancel out and produce something that makes sense, and is concordant between the two library preps (as would be expected).

But if we just take absolute expression levels from one library prep on a sample, and compare it to absolute levels of expression generated with another library prep, which is essentially what you are doing, even the very same samples can look worlds apart and that's not reflective of any biological difference at all.

Entering edit mode

Thanks Steve for your consideration on this matter !! I could not argue again-I hope if I could find any normal samples processed together, even a limited number..

Entering edit mode
Na Chen • 0
Last seen 4.0 years ago

Hi, I've been analysising my RNA-seq data by DESeq2 and ballgown for a while. As I know, these softwares recommend not to remove batch effect if you want to do DE analysis, instead, you can use batch as a covariate. This is my personal opinion, for reference only.


Login before adding your answer.

Traffic: 407 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6