Entering edit mode
mt1022
•
0
@mt1022-11725
Last seen 7.0 years ago
I am trying to understand the procedure of DESeq2 by reproducing its results.
Here is the results of a sample dataset:
library(MASS) library(DESeq2) dds <- DESeq2::makeExampleDESeqDataSet(m = 4) dds <- DESeq(dds) head(counts(dds, normalized = TRUE)) # ' sample1 sample2 sample3 sample4 # gene1 47.607534 9 36.943864 29.718701 # gene2 7.934589 3 1.847193 0.000000 # gene3 3.967295 25 6.465176 17.831221 # gene4 111.084247 18 45.256233 12.878104 # gene5 17.852825 0 2.770790 7.924987 # gene6 31.738356 50 82.200096 83.212364' head(results(dds)) # 'log2 fold change (MLE): condition B vs A # Wald test p-value: condition B vs A # DataFrame with 6 rows and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene1 30.817525 0.2359275 0.9203109 0.2563563 0.7976757 0.9976169 # gene2 3.195446 -2.5502438 2.4647424 -1.0346898 0.3008138 0.9976169 # gene3 13.315923 -0.2556672 1.2765912 -0.2002734 0.8412668 0.9976169 # gene4 46.804646 -1.1487331 0.9077488 -1.2654747 0.2057012 0.9976169 # gene5 7.137151 -0.7421327 1.8449693 -0.4022467 0.6875025 0.9976169 # gene6 61.787704 1.0171277 0.6298871 1.6147779 0.1063588 0.9976169' head(mcols(dds))[c('dispGeneEst', 'dispFit', 'dispersion')] # 'DataFrame with 6 rows and 3 columns # dispGeneEst dispFit dispersion # <numeric> <numeric> <numeric> # 1 0.50319679 0.3433675 0.3749897 # 2 0.00000001 2.6887259 2.3152674 # 3 0.74123710 0.6999752 0.7091476 # 4 0.85743018 0.2506918 0.3717475 # 5 2.89172970 1.2435861 1.4900292 # 6 0.01634755 0.2073715 0.1726523'
I am trying to fit a GLM manually based on my understanding with the data for the 6th row:
# create predictor and response data frame df <- data.frame( y = c(32, 50, 82, 83), design = c(0, 0, 1, 1) ) # fit a GLM, using negative.binomial link from MASS. # theta seems to be reciprocal of dispersion: summary(glm(y ~ design + 1, data = df, family = negative.binomial(theta = 1/0.1726523))) # 'Call: # glm(formula = y ~ design + 1, family = negative.binomial(theta = 1/0.01634755), # data = df) # # Deviance Residuals: # 1 2 3 4 # -1.14949 1.03663 -0.03598 0.03586 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 3.7136 0.1553 23.912 0.00174 ** # design 0.6992 0.2024 3.454 0.07455 . # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # (Dispersion parameter for Negative Binomial(61.1712) family taken to be 1.184123) # # Null deviance: 16.7610 on 3 degrees of freedom # Residual deviance: 2.3985 on 2 degrees of freedom # AIC: 32.697 # # Number of Fisher Scoring iterations: 3'
while `log2(exp(0.6992)) = 1.008732` is close to `1.0171277`. The SE `log2(exp(0.2162)) = 0.3119` is quite far from `0.6298871`.
I would like to know what's wrong with my manual approach.
Thanks for any help.
Thanks for the hint. I realized this and used a fixed dispersion estimated by DESeq2 (0.1726523) for the negative binomial link. But the summary function report a different dispersion, which is very close to the dispGeneEst of DESeq2 (1/0.0163 = 61.3 ~ 61.17). Based on this thread (http://r.789695.n4.nabble.com/glm-nb-versus-glm-estimation-of-theta-td893939.html), it seems that `summary` function estimated dispersion itself. The correct way to obtain SE is:
This way, the SE = log2(exp(0.4369)) = 0.6303, which is very close to that reported by DESeq (0.6298871). I assume the trivial differences are due to usage of rounded normalized counts?
Cheers!
Right, DESeq2 uses the raw counts plus an offset to deal with normalization.