I have RNAseq from three treatments (A, B, C), each with biological replicates ran in two batches (batch1 with A & B; batch2 with C). Unfortunately, the batch effects are fully confounded with the condition (e.g., to compare A vs. C). Here, I understand it is impossible to separate the batch effects from the treatment, regardless of what statistical method I use. I thought of methods such as ComBat, removeBatcheffect (limma) but can not estimate covariates to include in the batch correction. Then thought of using the control-genes based methods such as RUVg-method; however I do not have ERCC control genes in the data or an independent data set of similar treatments to obtain the negative (or positive) control genes. I understand best would be to redo the experiment with good design. With that said, I am curious if anyone have suggestions or options that I can explore and be able to use the data in some way? Appreciate your help. Thank you.
Perform analysis as usual and validate the key findings with independent experiments to be sure that the major conclusions you make are biology- rather than batch.driven.
If you want to analyze the data, you will have to assume that the biological differences you want to detect are 'larger' than the technical batch effects. And that usually has more to do with how the batches came about rather than the underlying biology. For example, if you prepared all the total RNA with the same batch of reagents, just days apart, and then sent off to be sequenced, then there is a good chance that any technical batch effects will be pretty minor. In which case you could probably make a compelling argument that it's fine.
However, if these batches were processed at different times, by different people, using different reagents and then sequenced at really different times as well I would be shocked if the batch effects didn't dominate. Not that there is any way to determine if it's batch or biology. If you are feeling lucky, and have the ability to get new data to validate, then doing what ATpoint says is an option. But that is dependent on how easy/cheap it is to get new data. If that's not really an option, then IMO doing the analysis is essentially the same as not doing it - unless you can validate, the results you get are about as informative as not doing the analysis at all (e.g., you don't know anything, really, regardless).
Thank you for your response and suggestions. The batches were processed at different time, but followed the same library prep, reagents and sequencing protocol. Looks like I don't have much choice other than generating the new data and validating the findings.
Perform analysis as usual and validate the key findings with independent experiments to be sure that the major conclusions you make are biology- rather than batch.driven.
Thank you for your response.