Dear Community,
I would like to ask for a rather challenging scenario of putative batch effect correction for differential expression analysis. Very briefly, I have pre-processed (TMM normalization, log2CPM transformation) a large number of RNA-Seq samples from patients with different solid tumors, that have undergone conventional therapy prior sequencing. The main limitation, is that due to some initial purposes of the study, control or healthy samples did not included. My current putative goal, is to assess the possibility of performing differential expression analysis, based on some reference tissue samples.
On this premise, one "naive" approach would be from GTEx to download gene counts for any tissue types that are also included in the cancer data, and pre-process similarly. Thus my crucial questions are the following:
1) Is possible to perform any batch effect correction approach ? like removeBatchEffect from limma or even ComBat ? however, for differential expression analysis this would be erroneous due to the overestimation of residuals of freedom and relative inflation of statistics and variance estimation ? and additionally, as the 2 batches of samples do not contain both biological conditions, technical variance would dominate any correction for contribution of biological signal ?
2) Alternatively, blocking through the design matrix and including the batch information into the linear model would still be problematic, because any of the samples from either condition, would be totally "utilized" to estimate the batch effect ? and could not be used for DE ?
Is there a possibility for any alternative solution, or due to this experimental design no DE analysis could be performed ?
Thank you in advance,
Efstathios
Dear Steve,
thank you for your valuable comments and feedback !! I will take a detailed look on the papers you mentioned-actually, i have found also some other examples in the literature based on the TCGAbiolinks R package about various case studies:
https://github.com/ELELAB/TCGAbiolinks_examples
and the relative publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006701
but still, I haven't been able to figure out if the data merging included cancer samples from one study with only normal from another data repository, like GTEx-nevertheless, perhaps the example you mention shares much more "common ground" for the integration between GTEx and TCGA data-
my last important question that I would like your opinion:
actually, i managed to extract clinical information about the cancer patients, on which cancer_type-localization information for each cancer sample-for example, brain, pancreas, CRC etc. You think, that an approach that would use:
1) obtain GTEx data for only from the same tissue origin, like brain, pancreas, etc. Pre-process-if possible-using the same normalization and transformation approaches. Final step, merge based on common gene symbols.
2) Then, as a last step use limma or edgeR but with a paired design matrix, essentially comparing each cancer group from the batch of cancer only samples, with the normal samples belonging to the same tissue ? would that in your opinion reduce the batch effect ?
I hate to bo the bearer of bad news, but if I'm being absolutely honest: if you have no "common ground" samples (as you put it (nicely done)), I'd have little to no hope in any analysis approach giving me reliable-enough data that would then make me want to then spend the mountains of time required to try and make sense of some biology we may have uncovered.
To provide some context: I've seen very different expression profiles generated from the same samples by the same person when these RNA libraries were generated using different protocols (the goal of this exercise was to compare the performance of one library kit vs the other).
When these are then compared against another set of samples, each generated with the matched library prep, the logFC's calculated between the groups then become comparable again -- so these technical effects largely cancel out and produce something that makes sense, and is concordant between the two library preps (as would be expected).
But if we just take absolute expression levels from one library prep on a sample, and compare it to absolute levels of expression generated with another library prep, which is essentially what you are doing, even the very same samples can look worlds apart and that's not reflective of any biological difference at all.
Thanks Steve for your consideration on this matter !! I could not argue again-I hope if I could find any normal samples processed together, even a limited number..