I have RNA sequencing data for which I would like to look at the differential gene expression effect of a certain treatment or condition while correcting for batch. However, I am not sure how to set up a proper design formula to do this.
My data looks like this (simplified):
batch treatment 1 a control 2 b treated 3 c control 4 c treated
Except, in my actual data I have between 15-19 replicates of each of these 4.
Now, if all of these where processed in a different batch, I would use the following design:
~ batch + treatment
However, in my case, I think that there should be a better way to do this. If I look at the differentially expressed genes between 3 and 4, there should be no batch effect there to be corrected for. If I look at the differential expression between 1 and 2, there is a batch effect on top of the effect of the actual treatment effect that I am looking for. I think I should be able to look at what does and what does not overlap between these two comparisons, and use that to better define my batch effect.
I have been looking through other peoples experiment designs in DEseq and think that I should take the interaction between different variables into account in my design formula, but I am not completely sure how to properly set this up. Anyone have some insight into this?