Question

How does the design formula affect dispersion estimate in DESeq2?

0

Entering edit mode

changxu.fan ▴ 20

@changxufan-24128

Last seen 12 months ago

United States

I read the code of DESeq2 roughly and realized that for each gene, one dispersion estimate is used across all samples. My work is to compare the gene expression profiles of mice from 2 different genetic backgrounds (WT and KO). Certain genes are expected to be highly expressed in one genotype but not expressed in the other. I'm curious about how DESeq2 would estimate 1 dispersion value that would fit both genotypes.

I read from the DESeq2 vignette that my design formula will affect the way dispersion is calculated, but no details were given. I couldn't find details regarding how DESeq2 handles this from the reference manual either. I'm not very familiar with bayesian models so I decided to reach out here instead of digging through the code.

The type of sample I receive looks like this:

df <- data.frame(sample = paste0("s", 1:6),
                 genotype = c("WT", "KO", "WT", "KO","WT", "WT" ),
                 batch = paste0("batch", c(1,1,2,2,3,3)))

> df
  sample genotype  batch
1     s1       WT batch1
2     s2       KO batch1
3     s3       WT batch2
4     s4       KO batch2
5     s5       WT batch3
6     s6       WT batch3

The current "flaw" is that in the batch3 both are WT, because the experiment is not done yet and I'll have more batches coming.

To construct a generalized linear model, it seems reasonable to have the design formula like this: ~ batch + genotype. From my understanding of glm, the order of batch and genotype in the design formula shouldn't matter. However, DESeq2 vignette suggests putting the most important variable in the last place. I assume this is for the purpose of dispersion estimation?

Also I think with the design formula ~batch + genotype, batch3 would be useless in the estimation of differential expression between genotypes since the differences are regressed out in the "batch" part, is that correct?

I would highly appreciate any input on this! Thanks~

deseq2 • 1.8k views

ADD COMMENT • link updated 2.7 years ago by Michael Love 41k • written 3.6 years ago by changxu.fan ▴ 20

0

Entering edit mode

“the vignette suggests putting the most important variable in the last place”

See the section, “Note on factor levels”. It only matters in that, when you call results(dds) with no other argument, the software has to guess which coefficient you want a results table for. It uses by default (only when no other information is provided) the last variable in the design, which is typical for R and Bioc packages.

ADD REPLY • link 3.6 years ago Michael Love 41k

0

Entering edit mode

changxu.fan ▴ 20

@changxufan-24128

Last seen 12 months ago

United States

Michael Love I apologize for opening this old post again... seems dispersion is calculated in a "within-group" manner. However, I'm struggling to understand how "groups" are defined with respect to a specific model matrix.

Say If my experiment has this "paired" design: for each patient, I have 2 data points: pre and post treatment.

df <- data.frame(sample = paste0("s", 1:4),
                 condition = c("pre", "post", "pre", "post" ),
                 patient.id = c(1, 1, 2, 2)) 

> df
  sample genotype patient.id
1     s1      pre          1
2     s2     post          1
3     s3      pre          2
4     s4     post          2

The design being ~patient.id + condition. In this scenario, for the purpose of dispersion estimation, what would be a "group"? would sample s1 and s2 be a group? Or s1 and s3?

Thanks a lot!!

ADD COMMENT • link 2.7 years ago changxu.fan ▴ 20

0

Entering edit mode

So I went ahead and did some tests:

# fix dispersion at a very small value. This way, 
# dispersion should be accurately estimated even with just 4 samples
# and over-estimation of dispersion would mean that something unintended happened...

# positive control:

dds <- makeExampleDESeqDataSet(n=100,m=4)
colData(dds)$batch <- rep(1:2, each = 2)
colData(dds)$condition <- rep(c(1, 2), 2)
cts <- rnbinom(4, mu=rep(c(10, 1000), 2), size=1/0.0001)
dds@design <- ~ condition
mode(cts) <- "integer"
counts(dds)[1,] <- cts
sizeFactors(dds) <- rep(1,4)
dds <- estimateDispersionsGeneEst(dds)
colData(dds)
DataFrame with 4 rows and 3 columns
        condition     batch sizeFactor
        <numeric> <integer>  <numeric>
sample1         1         1          1
sample2         2         1          1
sample3         1         2          1
sample4         2         2          1
> counts(dds) %>% head()
      sample1 sample2 sample3 sample4
gene1      12    1033      10    1042
gene2       4       5      10       5
gene3       0       1       6       8
gene4       0       0       1       0
gene5      58      48      58      30
gene6       7      39      70      36
> mcols(dds)$dispGeneEst[1]
[1] 0.00000001

### dispersion is estimated relatively accurately.

## now change design:
dds <- makeExampleDESeqDataSet(n=100,m=4)
colData(dds)$batch <- rep(1:2, each = 2)
colData(dds)$condition <- rep(c(1, 2), 2)
cts <- rnbinom(4, mu=rep(c(10, 1000), 2), size=1/0.0001)
dds@design <- ~ batch 
# design is changed
mode(cts) <- "integer"
counts(dds)[1,] <- cts
sizeFactors(dds) <- rep(1,4)
dds <- estimateDispersionsGeneEst(dds)
colData(dds)
counts(dds) %>% head()
mcols(dds)$dispGeneEst[1]

> mcols(dds)$dispGeneEst[1]
[1] 3.559438
# dispersion is very much overestimated.

## now testing out the design in my mind:
dds <- makeExampleDESeqDataSet(n=100,m=4)
colData(dds)$batch <- rep(1:2, each = 2)
colData(dds)$condition <- rep(c(1, 2), 2)
cts <- rnbinom(4, mu=rep(c(10, 1000), 2), size=1/0.0001)
dds@design <- ~ batch + condition
# use both variables
mode(cts) <- "integer"
counts(dds)[1,] <- cts
sizeFactors(dds) <- rep(1,4)
dds <- estimateDispersionsGeneEst(dds)
colData(dds)
counts(dds) %>% head()
mcols(dds)$dispGeneEst[1]

> mcols(dds)$dispGeneEst[1]
[1] 0.129209

# run this part of the code again:
> mcols(dds)$dispGeneEst[1]
[1] 0.04580673

# there is some variability. But overall not too bad given the limited number of samples

I struggled to understand what happened when the ~ batch + condition was specified as the formula. After reading DESeq2 code and catching up on some GLM knowledge, this is what I have in mind:

Dispersion measures variability NOT explained by the design matrix. When solving for alpha with ~batch + condition as formula, it's sort of like regressing on both variables. And the variation explained by them will not be a part of dispersion.

Hoping I'm making some sense here.

Best regards!

ADD REPLY • link 2.7 years ago changxu.fan ▴ 20

0

Entering edit mode

Correct, dispersion captures variance above the Poisson, conditioned on the covariates in the design.

ADD REPLY • link 2.7 years ago Michael Love 41k

score 2 · Accepted Answer · 2020-09-12

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

“I read from the DESeq2 vignette that my design formula will affect the way dispersion is calculated, but no details were given. I couldn't find details regarding how DESeq2 handles this from the reference manual either. I'm not very familiar with bayesian models so I decided to reach out here instead of digging through the code.”

We spent a lot of time writing the 2014 paper in such a way that it would be readable to a broad audience, also with full detail on how the steps are performed.

Once you’ve looked over the paper and Methods there, feel free to post with further questions.

ADD COMMENT • link 3.6 years ago Michael Love 41k

0

Entering edit mode

Hi Michael, I tried to understand the paper to the best I could. May I ask for some intuition? Say if I have WT and KO mice, for a specific gene, if WT samples have a dispersion value of 0.5, and KO samples have a dispersion value of 50, how would this situation be handled (assume the mean expression is the same between WT and KO)? Will the dispersion estimate end up more like an average? I apologize in advance if this is absurd or if it's somewhere in the paper but I didn't see it.

ADD REPLY • link 3.6 years ago changxu.fan ▴ 20

0

Entering edit mode

Yes, the dispersion would probably end up being in the middle. Note however that then the DESeq2 model would be mis-specified (the model assumes constant gene dispersion across samples) so you would not necessarily trust the p-values.

> dds <- makeExampleDESeqDataSet(n=100,m=100)
> cts <- rnbinom(100, mu=100, size=rep(c(1/.1, 1/.5),each=50))
> mode(cts) <- "integer"
> counts(dds)[1,] <- cts
> sizeFactors(dds) <- rep(1,100)
> dds <- estimateDispersionsGeneEst(dds)
> mcols(dds)$dispGeneEst[1]
[1] 0.2464433

Also, remember that the dispersion gives something like the ratio of the variance to the square of the mean. Saying a gene has constant dispersion does not mean it has constant variance.

> var(cts[1:50])/mean(cts[1:50])^2
[1] 0.1112132
> var(cts[51:100])/mean(cts[51:100])^2
[1] 0.5547675