Reading more on how dispersion shrinkage is working, I wanted to clarify what my understanding is as well as address a couple of questions. I have read both the DESeq2 and DDS papers that talk about dispersion shrinkage.
The point of dispersion shrinkage is to share information (dispersion estimates) across genes with similar average expression levels. This is under the assumption that genes with similar expression levels should have similar dispersions and are thus subject to shrinkage once a curve is fitted that represent an accurate estimate for dispersion levels. The larger a dispersion value, the larger the difference in expression has to be in order for a gene to be called DE. As the number of replicates for each condition increases, the amount of dispersion shrinkage per gene decreases as we are then able to estimate the dispersion parameter from the data without shrinkage.
1) Biologically, how can we make the assumption that genes with similar expression levels have similar dispersions? Namely, what is the evidence for this?
2) It is recommended for most cases to include all samples in the experiment when running DESeq2 in order to more accurately estimate the dispersion parameter for each gene. Is this because this gives DESeq2 more points to use in order to fit the dispersion estimate?
3) If 2 is true, then including more samples should not effect the magnitude of shrinkage for each gene's dispersion estimate? This is only affected when more replicates per condition are included.
4) Dispersion estimates are made gene-wise. Is this gene-wise estimate specific to the genes within a group of replicates? For example, in two different conditions (1& 2), the dispersion estimate for gene A in condition 1 will not necessarily be the same as the dispersion estimate for gene B in condition 2 (especially if they are expressed at different levels).
Any comments and help on understanding this would be greatly appreciated.