Question

How to get the dispersion values before fitting?

0

Entering edit mode

raya.fai ▴ 60

@rayafai-9396

Last seen 21 months ago

Israel

Hi,

I am interested in finding genes with large dispersion values in the same condition (all the samples are biological replicates of the same condition) and I do not want to make the assumption that genes with similar expression levels have similar dispersion values. This is why I am interested in getting the dispersion values before fitting/shrinking towards the curve.

I have several questions: I see there is a column named dispGeneEst in mcols(dds).

Are the values in the dispGeneEst column the dispersion values before fitting?
What does it mean if a gene has the maximum dispersion value of 10 or the minimum value of 1.00E-08?
Is it correct to use the dispGeneEst values in my case?
Is the dispersion value of a gene based on three biological replicates is reliable or do I need more replicates?
If I want to run DESeq2 without comparing two conditions, just for getting the normalized counts and the dispersion values, is it enough to specify design=~ 1 in the DESeqDataSetFromMatrix function?

Thank you very much.

All the best,

Raya

DESeq2 • 1.3k views

ADD COMMENT • link 4.1 years ago raya.fai ▴ 60

score 1 · Answer 1 · 2021-02-07

Some pointers that may help:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#access-to-all-calculated-values

(explains how to find more information about the columns, see the description metadata column for the rowData)

For the MLE estimate, estimate at the minimal value means that the data is essentially consistent with Poisson (the variance happened to be at or below the mean).

The maximum is not 10 but relates to the sample size. The maximal SD to mean ratio (for non-negative data) comes with a single outlier:

> x <- c(12345,rep(0,19))
> var(x)/(mean(x)^2)
[1] 20

Yes you can use the individual MLE values, if you have a good number of samples (no, 3 is way too few for the MLE dispersion estimate to be reliable, I would think at the least > 10 samples per condition to get a reliable ML estimate of dispersion without the Bayesian formulation). Alternatively, if you don't want the mean to contribute you can use fitType="mean" and dispersions(dds), the final estimates. This will form a distribution over all genes regardless of mean value. I would recommend to perform minimal filtering if you choose to use this technique, e.g. X or more samples with a count of 10 or more. This helps to remove the very noisy estimates of dispersion on the far left of the dispersion-over-mean plot.

Yes, ~1 will estimate with all the samples in the same condition.