Search
Question: Bypass dispersion estimation in DESeq2
0
10 months ago by
stevenn.volant0 wrote:

Hello,

I use the DESeq2 package for many projects in which the number of samples is quite small. But, sometimes i have some projects with almost 700 samples and i want to do the same kind of analysis for all the projects.

When the number of samples is huge the estimation of the dispersion is very time consuming. In this specific case, i think that using the estimation providing by the parametric or local regression is not required, the gene-wise dispersion can be used (or something close to this estimation).

So my question is, is there a way to by pass the estimation and directly use the gene-wise dispersion ?

Best

modified 10 months ago by Michael Love19k • written 10 months ago by stevenn.volant0
1
10 months ago by
Michael Love19k
United States
Michael Love19k wrote:

hi Stevenn,

Yes, you are right that the MAP will be nearly the same as the MLE with hundreds of residual degrees of freedom. To use the MLE instead of calculating the MAP you can use:

dds <- estimateDispersionsGeneEst(dds)
dispersions(dds) <- mcols(dds)\$dispGeneEst
dds <- nbinomWaldTest(dds) # or nbinomLRT()

But also, aside from the above, you can use some simple wrappers to parallelize many steps in DESeq2. This is what occurs within DESeq() when you use parallel=TRUE, but I'm trying to clean up the function now so it's easier for users to copy the code and do it on their own. There may be some better techniques for memory management which I'm actively exploring.

The above steps can be distributed to any number of cores, e.g.,

idx <- factor(sort(rep(seq_len(nworkers),length=nrow(dds))))
dds.list <- lapply(levels(idx), function(l) dds[idx==l,])
dds.list <- bplapply(dds.list, function(d) {
# some code to run on 'd'
# a function that should NOT go here is estimateDispersionsFit()
# ...that function needs to see all of the rows of 'dds' together
})
dds <- do.call(rbind, dds.list)

If this is becoming a frequent enough use case, could the above (first code chunk) be exposed as an option in the DESeq() wrapper?

That's a reasonable suggestion. DESeq2 then reduces to MLE dispersion with the GLM, which you could just as well get with e.g. glm.nb() with log of size factors as offset, but on the other hand you then have an object that can be input to lfcShrink() to get moderated fold changes, which is novel.