Hi All,
I am conducting differential expression analysis on RNA-Seq data from sorted cells vs. unsorted cells from mice. A mutant mouse expressing fluorescent protein in a specific cell population was used to conduct the cell sorting, but we included unsorted cells from WT mice as well (to ensure the FP didn't alter global expression). Thus, there are 3 paired replicates of sorted cells and unsorted (bulk cell prep) for the mutant FP expressing mice, and 3 replicates of unsorted (bulk cell prep) from the WT mice. Sample matrix as follows:
Sample..........SampleID..........Fraction
Mutant Rep 1 Sort..........A..........PosSort
Mutant Rep 2 Sort..........B..........PosSort
Mutant Rep 3 Sort..........C..........PosSort
Mutant Rep 1 Unsorted..........A..........Unsorted
Mutant Rep 2 Unsorted..........B..........Unsorted
Mutant Rep 3 Unsorted..........C..........Unsorted
WT Rep 1 Unsorted..........D..........Unsorted
WT Rep 2 Unsorted..........E..........Unsorted
WT Rep 3 Unsorted..........F..........Unsorted
As expected, the unsorted WT and FP-expressing replicates are virtually indistinguishable in the rlog PCA, while the sorted cells are quite distinct. Sample pairs are indicated by the shapes and blue lines:

I have analyzed the data with DESeq2 with designs either accounting for pairing (SampleID factor) or not, with dramatically different results. Here are the design formula:
One Factor (1F) model, no pairing: ~Fraction
Two Factor (2F) model, control for sample pairing: ~SampleID + Fraction
Looking at the histograms of p values for these two models, I see that the 2F model has a more pronounced pile on the right side. This becomes quite dramatic in the adjusted p values histogram. Meanwhile, the 1F model seems to yield more than double the number of genes on the left side and looks smoother across.

I was quite surprised by this, as I was hoping that controlling for differences in sorted vs. unsorted cells across experimental replicate pairs would strengthen the model. Upon further investigation, it seems that introducing the SampleID factor substantially increased the dispersion estimates for a majority of genes:

Perhaps unsurprisingly based on the dispersion estimates, the 2F model yielding slightly smaller log2FC effect sizes, and higher adjusted p values for the majority of genes (black lines are y=x):

I should add that these analyses included the unsorted WT samples; removing them has essentially no effect on the 2F model with sample pairing, while it does substantially weaken the 1F model.
Analysis..................................................... # Genes pAdj < 0.1
1F model, full dataset (9 samples) .................. 3974
1F model, only Mut (6 samples) ...................... 822
2F model, full dataset (9 samples) .................... 1286
2F model, only Mut (6 samples) ...................... 1276
I have several questions here, as I am trying to understand these results:
1) Are the effect sizes and p values direct consequences of the difference in dispersion estimates, or is something else going on here?
2) Is DESeq2 correcting for differences between sample pairs before dispersion estimation? I assumed the global dispersion estimates (the global dependence of dispersion on mean for the entire experiment) would be the same regardless of design formula, and the SampleID factor would be used to calculate effect size and significance. But clearly the design formula is having a substantial impact on global dispersion estimates. Any resource that I could read to help me understand why this is the case would be greatly appreciated.
3) Am I wrong to interpret the 2F sample pairing model as under-performing here? As the main goal of this analysis is to identify DEGs between sorted and unsorted samples (mostly reflected in PC1 axis), it seems to me that the 1F model is more appropriate based on the p value histograms. It also yields significantly more DEGs. But is it possible that this 1F model is actually underestimating the dispersion and overestimating the significance in comparison to the 2F model?
4) In other datasets with more replicate pairs, I have seen that the sample pairing factor has helped some analyses. Is the presumed penalty in this case due to the low number of replicates? I am wondering if there is some analytical way to determine which data sets will be better suited to which model. It isn't very time consuming to run these analyses, but running all possible models and using the one that yields the most DEGs doesn't seem right.
Thanks for your help and input!

Thanks Michael, the degrees of freedom definitely makes sense. I did not realize a coefficient was estimated for each unpaired WT sample.
I should have clarified better. The mutant is only expressing a fluorescent protein in a subset of cells. I expect this will have very little impact on this subset of cells' expression profile, and virtually no impact on the expression profile of the bulk tissue (unsorted). Indeed, you can see in the PCA that the WT and Mut unsorted samples cluster mostly by the experimental batch; each set of WT unsorted / Mut unsorted / Mut sorted was prepared on a separate day, so there are actually 3 sets of 3 samples (the blue lines in fact correspond to batches).
Biologically, I'm interested in which genes are specifically enriched or depleted in the sorted cells compared to the bulk tissue. To me, both the WT unsorted and Mut unsorted samples provide an equally good representation of the bulk tissue. The reason I wanted to use sample pairing info is because PC2 seems to reflect the sample pairs, which I now realize is more likely an experimental batch effect. Since the unsorted samples cluster moreso by batch than by genotype, maybe the better model would be to use a batch factor instead of sample and use ~batch + fraction?