Question

Understand how DESeq estimates SE for log2 fold change

0

Entering edit mode

mt1022 • 0

@mt1022-11725

Last seen 7.0 years ago

I am trying to understand the procedure of DESeq2 by reproducing its results.

Here is the results of a sample dataset:

library(MASS)
library(DESeq2)

dds <- DESeq2::makeExampleDESeqDataSet(m = 4)
dds <- DESeq(dds)

head(counts(dds, normalized = TRUE))
# '         sample1 sample2   sample3   sample4
# gene1  47.607534       9 36.943864 29.718701
# gene2   7.934589       3  1.847193  0.000000
# gene3   3.967295      25  6.465176 17.831221
# gene4 111.084247      18 45.256233 12.878104
# gene5  17.852825       0  2.770790  7.924987
# gene6  31.738356      50 82.200096 83.212364'

head(results(dds))
# 'log2 fold change (MLE): condition B vs A 
# Wald test p-value: condition B vs A 
# DataFrame with 6 rows and 6 columns
#        baseMean log2FoldChange     lfcSE       stat    pvalue      padj
#       <numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric>
# gene1 30.817525      0.2359275 0.9203109  0.2563563 0.7976757 0.9976169
# gene2  3.195446     -2.5502438 2.4647424 -1.0346898 0.3008138 0.9976169
# gene3 13.315923     -0.2556672 1.2765912 -0.2002734 0.8412668 0.9976169
# gene4 46.804646     -1.1487331 0.9077488 -1.2654747 0.2057012 0.9976169
# gene5  7.137151     -0.7421327 1.8449693 -0.4022467 0.6875025 0.9976169
# gene6 61.787704      1.0171277 0.6298871  1.6147779 0.1063588 0.9976169'

head(mcols(dds))[c('dispGeneEst',   'dispFit', 'dispersion')]
# 'DataFrame with 6 rows and 3 columns
#   dispGeneEst   dispFit dispersion
#     <numeric> <numeric>  <numeric>
# 1  0.50319679 0.3433675  0.3749897
# 2  0.00000001 2.6887259  2.3152674
# 3  0.74123710 0.6999752  0.7091476
# 4  0.85743018 0.2506918  0.3717475
# 5  2.89172970 1.2435861  1.4900292
# 6  0.01634755 0.2073715  0.1726523'

I am trying to fit a GLM manually based on my understanding with the data for the 6th row:

# create predictor and response data frame
df <- data.frame(
    y = c(32, 50, 82, 83),
    design = c(0, 0, 1, 1)
)

# fit a GLM, using negative.binomial link from MASS.
# theta seems to be reciprocal of dispersion:
summary(glm(y ~ design + 1, data = df, family = negative.binomial(theta = 1/0.1726523)))

# 'Call:
# glm(formula = y ~ design + 1, family = negative.binomial(theta = 1/0.01634755), 
#     data = df)
# 
# Deviance Residuals: 
#        1         2         3         4  
# -1.14949   1.03663  -0.03598   0.03586  
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)   
# (Intercept)   3.7136     0.1553  23.912  0.00174 **
# design        0.6992     0.2024   3.454  0.07455 . 
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for Negative Binomial(61.1712) family taken to be 1.184123)
# 
#     Null deviance: 16.7610  on 3  degrees of freedom
# Residual deviance:  2.3985  on 2  degrees of freedom
# AIC: 32.697
# 
# Number of Fisher Scoring iterations: 3'

while `log2(exp(0.6992)) = 1.008732` is close to `1.0171277`. The SE `log2(exp(0.2162)) = 0.3119` is quite far from `0.6298871`.
I would like to know what's wrong with my manual approach.

Thanks for any help.

deseq2 • 958 views

ADD COMMENT • link updated 7.0 years ago by Michael Love 43k • written 7.0 years ago by mt1022 • 0

score 1 · Answer 1 · 2018-04-18

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 11 days ago

United States

The SE depends critically on the dispersion. DESeq2 estimates dispersion by sharing information from many genes, and so this is different than by looking at the data from a single gene.

ADD COMMENT • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Thanks for the hint. I realized this and used a fixed dispersion estimated by DESeq2 (0.1726523) for the negative binomial link. But the summary function report a different dispersion, which is very close to the dispGeneEst of DESeq2 (1/0.0163 = 61.3 ~ 61.17). Based on this thread (http://r.789695.n4.nabble.com/glm-nb-versus-glm-estimation-of-theta-td893939.html), it seems that `summary` function estimated dispersion itself. The correct way to obtain SE is:

summary(glm(y ~ design + 1, data = df, family = negative.binomial(theta = 1/0.1726523)), dispersion = 1)

# Call:
# glm(formula = y ~ design + 1, family = negative.binomial(theta = 1/0.1726523),
#     data = df)
#
# Deviance Residuals:
#        1         2         3         4
# -0.53324   0.46415  -0.01413   0.01407
#
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   3.7136     0.3139   11.83   <2e-16 ***
# design        0.6992     0.4369    1.60     0.11
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for Negative Binomial(5.792) family taken to be 1)
#
#     Null deviance: 3.02499  on 3  degrees of freedom
# Residual deviance: 0.50018  on 2  degrees of freedom
# AIC: 37.76
#
# Number of Fisher Scoring iterations: 4

This way, the SE = log2(exp(0.4369)) = 0.6303, which is very close to that reported by DESeq (0.6298871). I assume the trivial differences are due to usage of rounded normalized counts?

Cheers!

ADD REPLY • link 7.0 years ago mt1022 • 0

1

Entering edit mode

Right, DESeq2 uses the raw counts plus an offset to deal with normalization.

ADD REPLY • link 7.0 years ago Michael Love 43k