Hi, I have a basic question I have not seen asked yet and I am hoping for some feedback.
I have an RNAseq dataset in two batches with an imbalanced study design, and I had a large batch effect. I used combatseq to 'correct' for the batch effect.
PCA from the original count matrix showed the large batch effect with PC1 ~90% of the variance. After combatseq, the batch effect is reduced (pc1 ~37%), but still apparent. I could post images but unless someone thinks its necessary I'll leave them out for now.
My question is that now I have a count matrix output from combatseq, I can pass it into DESeq2 to do DE analysis, and I'm wondering if its a good idea (or even a valid approach) to incorporate batch into the design matrix to account for the batch effects that I can still observe in my EDA of the 'batch corrected' count matrix output from combatseq.
any insight would be appreciated!

Hi Dr. Love, Thank you for your response! I just wanted to say first - Thank you for making DEseq and also the enormous number of responses to so many questions on it. I'm an immunologist in a PhD program and come to R by necessity with only a small bit of previous programming experience. I've read tons of your responses and some i've read and re-read to understand deeper. Its been a huge help as I'm teaching myself how to do RNAseq analysis.
Before combatseq, DESeq identified literally about 95% of all genes detected as differentially expressed with padj <<<<0.05, even with batch in the design formula.
I used the combatseq matrix as input to deseq and still added batch into the design formula. The statistician I work with encouraged that because we still see batch effects. In one comparison between a patient group and healthy i'm observing a more "reasonable" number of DE genes of about 1100
I was not sure how to extract surrogate variables from combatseq, and i'm not sure about the loss of degrees of freedom either or how to interpret that. I'll put that question to my statistician colleague.
At the moment the approach seems to be working.
Thanks for the kind words.
Re: 95% of genes as DE, this sounds like the wrong design formula is being used, or the wrong reduced design, etc. I don't know any details from your post, but just look at some of the plotCounts for your top DE genes and you can likely get an idea. Sometimes users accidentally test on the batch variable, for example.
Ok I can take a look at that thank you for the advice. When I saw that I had thought that the batch effect was just too dramatic for the deseq to account for it, lacking any other experience doing this to draw on
I will get into it again and see what I can figure out on my own before asking for help and if i do I'll post my code with a specific question. Thank you so much!