Entering edit mode
mt1022
•
0
@mt1022-11725
Last seen 7.6 years ago
I am trying to understand the procedure of DESeq2 by reproducing its results.
Here is the results of a sample dataset:
library(MASS)
library(DESeq2)
dds <- DESeq2::makeExampleDESeqDataSet(m = 4)
dds <- DESeq(dds)
head(counts(dds, normalized = TRUE))
# ' sample1 sample2 sample3 sample4
# gene1 47.607534 9 36.943864 29.718701
# gene2 7.934589 3 1.847193 0.000000
# gene3 3.967295 25 6.465176 17.831221
# gene4 111.084247 18 45.256233 12.878104
# gene5 17.852825 0 2.770790 7.924987
# gene6 31.738356 50 82.200096 83.212364'
head(results(dds))
# 'log2 fold change (MLE): condition B vs A
# Wald test p-value: condition B vs A
# DataFrame with 6 rows and 6 columns
# baseMean log2FoldChange lfcSE stat pvalue padj
# <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# gene1 30.817525 0.2359275 0.9203109 0.2563563 0.7976757 0.9976169
# gene2 3.195446 -2.5502438 2.4647424 -1.0346898 0.3008138 0.9976169
# gene3 13.315923 -0.2556672 1.2765912 -0.2002734 0.8412668 0.9976169
# gene4 46.804646 -1.1487331 0.9077488 -1.2654747 0.2057012 0.9976169
# gene5 7.137151 -0.7421327 1.8449693 -0.4022467 0.6875025 0.9976169
# gene6 61.787704 1.0171277 0.6298871 1.6147779 0.1063588 0.9976169'
head(mcols(dds))[c('dispGeneEst', 'dispFit', 'dispersion')]
# 'DataFrame with 6 rows and 3 columns
# dispGeneEst dispFit dispersion
# <numeric> <numeric> <numeric>
# 1 0.50319679 0.3433675 0.3749897
# 2 0.00000001 2.6887259 2.3152674
# 3 0.74123710 0.6999752 0.7091476
# 4 0.85743018 0.2506918 0.3717475
# 5 2.89172970 1.2435861 1.4900292
# 6 0.01634755 0.2073715 0.1726523'
I am trying to fit a GLM manually based on my understanding with the data for the 6th row:
# create predictor and response data frame
df <- data.frame(
y = c(32, 50, 82, 83),
design = c(0, 0, 1, 1)
)
# fit a GLM, using negative.binomial link from MASS.
# theta seems to be reciprocal of dispersion:
summary(glm(y ~ design + 1, data = df, family = negative.binomial(theta = 1/0.1726523)))
# 'Call:
# glm(formula = y ~ design + 1, family = negative.binomial(theta = 1/0.01634755),
# data = df)
#
# Deviance Residuals:
# 1 2 3 4
# -1.14949 1.03663 -0.03598 0.03586
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.7136 0.1553 23.912 0.00174 **
# design 0.6992 0.2024 3.454 0.07455 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for Negative Binomial(61.1712) family taken to be 1.184123)
#
# Null deviance: 16.7610 on 3 degrees of freedom
# Residual deviance: 2.3985 on 2 degrees of freedom
# AIC: 32.697
#
# Number of Fisher Scoring iterations: 3'
while `log2(exp(0.6992)) = 1.008732` is close to `1.0171277`. The SE `log2(exp(0.2162)) = 0.3119` is quite far from `0.6298871`.
I would like to know what's wrong with my manual approach.
Thanks for any help.

Thanks for the hint. I realized this and used a fixed dispersion estimated by DESeq2 (0.1726523) for the negative binomial link. But the summary function report a different dispersion, which is very close to the dispGeneEst of DESeq2 (1/0.0163 = 61.3 ~ 61.17). Based on this thread (http://r.789695.n4.nabble.com/glm-nb-versus-glm-estimation-of-theta-td893939.html), it seems that `summary` function estimated dispersion itself. The correct way to obtain SE is:
This way, the SE = log2(exp(0.4369)) = 0.6303, which is very close to that reported by DESeq (0.6298871). I assume the trivial differences are due to usage of rounded normalized counts?
Cheers!
Right, DESeq2 uses the raw counts plus an offset to deal with normalization.