Question

How is DESeq2 handling replicates?

0

Entering edit mode

XTR5 ▴ 10

@p1000

Last seen 2.6 years ago

United States

Say I have three samples with a simple design formula ~condition:

sampleA - control

sampleB - treatment (replicate 1)

sampleC - treatment (replicate 2)

DESeq2 returns one fold-change for control vs. treatment, but it is possible to consider separately control vs. treatment rep 1 and control vs. treatment rep 2.

For the NB GLM for say geneX, there are three samples (A,B,C) for read counts Kij.

Of course if geneX had extremely low counts in replicate 1 but not in replicate 2, we would expect strong variance of LFC estimates from replicate 1, so they shouldn't be considered equally.

It's not clear to me how DESeq2 is reporting one value for this comparison? How are replicates 1 and 2 being combined? Would the same explanation apply if we extended this example to say 5 control samples and 5 treatment samples?

A related question is which samples are used for the dispersions estimates? I read that the design formula is used to estimate the dispersions, which I don't entirely understand. I did see in the paper that the dispersion shrinkage decreases as the sample size increases. How else is the design formula affecting the dispersion estimation?

DESeq2 • 1.2k views

ADD COMMENT • link updated 2.8 years ago by Michael Love 41k • written 2.8 years ago by XTR5 ▴ 10

score 2 · Accepted Answer · 2021-06-30

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 23 hours ago

United States

In GLM, observations with more variance (e.g. low counts in NB GLM) contribute less to the coefficient estimate.

All samples are used to estimate dispersion (this has been asked before a few times on the support site, maybe you can search for similar questions).

ADD COMMENT • link 2.8 years ago Michael Love 41k

0

Entering edit mode

Dear Michael,

"In GLM, observations with more variance (e.g. low counts in NB GLM) contribute less to the coefficient estimate."

Intuitively this makes sense, but I don't understand how this works in practice. The theory section of the manual states "counts Kij for gene i, sample j are modeled using a negative binomial distribution with fitted mean μij and a gene-specific dispersion parameter αi," so am I wrong to think there are NB GLMs for sampleB and C in the above example?

If you wouldn't mind clarifying in a simplified example where sample A,B, and C all have size factors of 1.0, dispersion for geneX is 0.25, and counts for geneX are 500 in sample A (control), 10 in sample B (rep1), and 100 in sample C (rep2).

ADD REPLY • link 2.8 years ago XTR5 ▴ 10

1

Entering edit mode

Intuitively this makes sense, but I don't understand how this works in practice.

For more detail on how it works, I'd recommend reading a GLM reference, there are many publicly available resources online as well as books on the topic.

ADD REPLY • link 2.8 years ago Michael Love 41k