We have an experiment where samples were collected and sequenced in several batch (matched ATACseq and RNAseq). We would like to include the batch in our analysis formula if possible. The problem is that after quality control and removing substandard samples, some of the batches have only one sample in them (we have several batches with many samples in and a few with only one).
We have used DESeq2 to do the differential testing (for both ATAC and RNA) and would also like to do more downstream analysis (such as clustering etc). We have included the batch in our design formula for DESeq2. Is this a terrible thing to do? Because the model can now set the batch effect to be anything for those samples, are they even adding anything to the analysis?
Secondly for the downstream analysis we are using rlog from DESeq2 and then passing those results to removeBatchEffects from limma. When we do this, the basically get a 0s in every row for those samples that in a batch on their own. (makes sense, because with only one sample in a batch the linear model can always find a beta value that will account for all variance)
Some solutions we have considered:
* Pool all the samples that are the only sample in their batch into a pooled batch.
* Do the differential analysis using all samples, but only use samples from multisample batches in the downstream analysis
* Remove the single batch samples from the analysis. We'd rather not do this, because it would severly reduce the number of samples in the analysis.