We have three RNA-seq batches for ~60 samples where batches are known and the "condition" is not well-distributed across batches. For example:
batch A = all tumor subtype 1
batch B = all tumor subtype 2
batch C = a mix of subtype 1 and subtype 2
For this reason, I suspect batch correction may not be possible due to perfect confounding on batches A/B. Assuming DESeq2 is not able to use the mixed batch C as an intermediary to do all 3 at once, it seems like I would need to split batch C into subtype 1 and subtype2 and do two separate batch corrections:
batch A + batch C subtype 1
batch B + batch C subtype 2
This would make the subtypes impossible to compare, but if that's the best I can do, so be it.
Secondly, assuming one of these above approaches works, I am not finding much detail on what DESeq2 actually does for default batch correction when having design = ~condition+batch for DESeqDataSetFromMatrix and having VST and RLOG transformations as blind = FALSE. Does this "DESeq batch correction" fully correct for batches, to where ComBat or SVA would be unneccessary afterwards? PCA results from this DESeq batch correction plus ComBat seem way too clean, with groups splitting too neatly, which makes me suspect f over-biasing the corrections for condition. If I want to normalize the data using DESeq2 and then run ComBat or SVA, should I steer clear of all batch information in DESeq2? Meaning I should designate design = ~condition for DESeqDataSetFromMatrix and having VST and RLOG transformations as blind = TRUE?
I apologize if this is a repeat question, but I can't seem to find much advice on these possible overlaps between two really useful and widespread tools.
If the batch designations come into play later in the DE calculations, but not at all in the transformations, it seems that ComBat may be the culprit. Thanks!