Question

Sample Pairing Design Effects on Dispersion Estimates in DESeq2?

0

Entering edit mode

SeqGoblin3 • 0

@seqgoblin3-22010

Last seen 5.9 years ago

New York, USA

Hi All,

I am conducting differential expression analysis on RNA-Seq data from sorted cells vs. unsorted cells from mice. A mutant mouse expressing fluorescent protein in a specific cell population was used to conduct the cell sorting, but we included unsorted cells from WT mice as well (to ensure the FP didn't alter global expression). Thus, there are 3 paired replicates of sorted cells and unsorted (bulk cell prep) for the mutant FP expressing mice, and 3 replicates of unsorted (bulk cell prep) from the WT mice. Sample matrix as follows:

Sample..........SampleID..........Fraction

Mutant Rep 1 Sort..........A..........PosSort

Mutant Rep 2 Sort..........B..........PosSort

Mutant Rep 3 Sort..........C..........PosSort

Mutant Rep 1 Unsorted..........A..........Unsorted

Mutant Rep 2 Unsorted..........B..........Unsorted

Mutant Rep 3 Unsorted..........C..........Unsorted

WT Rep 1 Unsorted..........D..........Unsorted

WT Rep 2 Unsorted..........E..........Unsorted

WT Rep 3 Unsorted..........F..........Unsorted

As expected, the unsorted WT and FP-expressing replicates are virtually indistinguishable in the rlog PCA, while the sorted cells are quite distinct. Sample pairs are indicated by the shapes and blue lines:

I have analyzed the data with DESeq2 with designs either accounting for pairing (SampleID factor) or not, with dramatically different results. Here are the design formula:

One Factor (1F) model, no pairing: ~Fraction

Two Factor (2F) model, control for sample pairing: ~SampleID + Fraction

Looking at the histograms of p values for these two models, I see that the 2F model has a more pronounced pile on the right side. This becomes quite dramatic in the adjusted p values histogram. Meanwhile, the 1F model seems to yield more than double the number of genes on the left side and looks smoother across.

I was quite surprised by this, as I was hoping that controlling for differences in sorted vs. unsorted cells across experimental replicate pairs would strengthen the model. Upon further investigation, it seems that introducing the SampleID factor substantially increased the dispersion estimates for a majority of genes:

Perhaps unsurprisingly based on the dispersion estimates, the 2F model yielding slightly smaller log2FC effect sizes, and higher adjusted p values for the majority of genes (black lines are y=x):

I should add that these analyses included the unsorted WT samples; removing them has essentially no effect on the 2F model with sample pairing, while it does substantially weaken the 1F model.

Analysis..................................................... # Genes pAdj < 0.1

1F model, full dataset (9 samples) .................. 3974

1F model, only Mut (6 samples) ...................... 822

2F model, full dataset (9 samples) .................... 1286

2F model, only Mut (6 samples) ...................... 1276

I have several questions here, as I am trying to understand these results:

1) Are the effect sizes and p values direct consequences of the difference in dispersion estimates, or is something else going on here?

2) Is DESeq2 correcting for differences between sample pairs before dispersion estimation? I assumed the global dispersion estimates (the global dependence of dispersion on mean for the entire experiment) would be the same regardless of design formula, and the SampleID factor would be used to calculate effect size and significance. But clearly the design formula is having a substantial impact on global dispersion estimates. Any resource that I could read to help me understand why this is the case would be greatly appreciated.

3) Am I wrong to interpret the 2F sample pairing model as under-performing here? As the main goal of this analysis is to identify DEGs between sorted and unsorted samples (mostly reflected in PC1 axis), it seems to me that the 1F model is more appropriate based on the p value histograms. It also yields significantly more DEGs. But is it possible that this 1F model is actually underestimating the dispersion and overestimating the significance in comparison to the 2F model?

4) In other datasets with more replicate pairs, I have seen that the sample pairing factor has helped some analyses. Is the presumed penalty in this case due to the low number of replicates? I am wondering if there is some analytical way to determine which data sets will be better suited to which model. It isn't very time consuming to run these analyses, but running all possible models and using the one that yields the most DEGs doesn't seem right.

Thanks for your help and input!

deseq2 • 691 views

ADD COMMENT • link updated 5.8 years ago by Michael Love 43k • written 5.8 years ago by SeqGoblin3 • 0

score 0 · Answer 1 · 2020-01-06

Just a quick note, for the p-value histograms, the spikes are just from low count genes, remove these first before performing QC on the p-value histogram.

"This becomes quite dramatic in the adjusted p values histogram" => the adjusted p-value distribution is not expected to be uniform.

Adding in the sample ID is helpful for the paired samples, but then when you also include WT, you really take a hit in the number of degrees of freedom, because you now have a coefficient for each WT sample, so those WT samples are no longer helping you estimate the dispersion. You then have 9 samples, but 7 coefficients to estimate, so still only 2 degrees of freedom.

I'd recommend, for comparing the sorted vs unsorted while controlling for sample, to just use the 6 samples that are involved, and ~sample + fraction as the design.

DESeq2 uses the design to estimate the dispersion, see the 2014 paper for details.

It's possible that the one factor model, making use of the WT samples as well, outperforms the two factor model, but I'm not sure it's a fair comparison because you're really interested in the sorted vs unsorted for mutant, right? Did you compare 1F and 2F on the mutant samples alone?