Say I have three samples with a simple design formula ~condition:
sampleA - control
sampleB - treatment (replicate 1)
sampleC - treatment (replicate 2)
DESeq2 returns one fold-change for control vs. treatment, but it is possible to consider separately control vs. treatment rep 1 and control vs. treatment rep 2.
For the NB GLM for say geneX, there are three samples (A,B,C) for read counts Kij.
Of course if geneX had extremely low counts in replicate 1 but not in replicate 2, we would expect strong variance of LFC estimates from replicate 1, so they shouldn't be considered equally.
It's not clear to me how DESeq2 is reporting one value for this comparison? How are replicates 1 and 2 being combined? Would the same explanation apply if we extended this example to say 5 control samples and 5 treatment samples?
A related question is which samples are used for the dispersions estimates? I read that the design formula is used to estimate the dispersions, which I don't entirely understand. I did see in the paper that the dispersion shrinkage decreases as the sample size increases. How else is the design formula affecting the dispersion estimation?
"In GLM, observations with more variance (e.g. low counts in NB GLM) contribute less to the coefficient estimate."
Intuitively this makes sense, but I don't understand how this works in practice. The theory section of the manual states "counts Kij for gene i, sample j are modeled using a negative binomial distribution with fitted mean μij and a gene-specific dispersion parameter αi," so am I wrong to think there are NB GLMs for sampleB and C in the above example?
If you wouldn't mind clarifying in a simplified example where sample A,B, and C all have size factors of 1.0, dispersion for geneX is 0.25, and counts for geneX are 500 in sample A (control), 10 in sample B (rep1), and 100 in sample C (rep2).
For more detail on how it works, I'd recommend reading a GLM reference, there are many publicly available resources online as well as books on the topic.