Question

Can DESeq2 deal with zero-inflated data

0

Entering edit mode

KELVINLEE • 0

@kelvinlee-9111

Last seen 10.2 years ago

Singapore

I have a RNA-seq data set that have many zero due to insufficient sequencing depth and low abundance for certain genes. I want to use DESeq2 to analyse my data, but not sure if DESeq2 can deal with zero-inflated data set like mine. I know that DESeq2 uses negative binomial instead of the zero-inflated negative binomial model. Hope someone can help me out. Thank you.

deseq2 • 5.8k views

ADD COMMENT • link updated 10.3 years ago by Simon Anders ★ 3.8k • written 10.3 years ago by KELVINLEE • 0

score 1 · Answer 1 · 2015-11-05

Have a try with the BioC package tweeDEseq. It uses the Poisson-Tweedie family of count distributions, which allow one to fit odd distributional features such as heavy-tails or zero-inflation. You will find more details in the vignette of the package and in the corresponding article:

Esnaola et al. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics, 14:254, 2013.

cheers,

robert.

score 1 · Answer 2 · 2015-11-05

1

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 5.5 years ago

Zentrum für Molekularbiologie, Universi…

Yes, you can use DESeq2 for this, because I doubt that you have "zero-inflated" data.

Note that the term "zero-inflated" does not simply mean that your data has more zeroes than usual RNA-Seq data sets. Rather, it means that the proportions of samples with zero values in the data is larger than what a negative-binomial or similar model would predict given the average counts for the gene across all samples.

Now, Poisson-mixture models of sequencing (such as the negative binomial model used in DESeq2 and similar tools) do predict that the proportion of zero counts increases if sequencing depth is low, so there is not inflation of zeroes as compared to the model, i.e., no need for a special zero-inflated null distribution. Hence, if your large number of zeroes is only because of the low sequencing depth, then DESeq2 (or any similar tool) should work fine.

Some authors claim that certain types of data or of experimental design (especially data with strong experimental [not: technical] noise) cause zero inflation and that then the negative binomial is a bad fit. As far as I udnerstand, these authors, however, do not claim that low sequencing depth is among the reasons for using a zero-inflated null distribution, because there, the conventional models predict the increase in zero counts quite fine.

ADD COMMENT • link 10.3 years ago Simon Anders ★ 3.8k

0

Entering edit mode

so is there a way I can check whether my data is zero-inflated?

ADD REPLY • link 10.3 years ago KELVINLEE • 0

1

Entering edit mode

hi, there are diffeferent approaches to model and test for goodness of fit to a zero-inflated distribution, see for instance here and here. One way to approach this question with tweeDEseq is simply to estimate the shape parameter from the Poisson-Tweedie distribution and check whether it is close to the shape value for negative-binomial (a=0) or something else (not negative-binomial):

y <- c(0,63,1,4,1,44,2,2,1,0,1,0,0,0,0,1,0,0,3,0,0,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,6,1,11,1,1,0,0,0,2)
thetahat <- mlePoissonTweedie(y)
getParam(thetahat)
         mu           D           a
  3.0408163 102.5138255   0.5753331

in this case, the distribution of counts with all these many zeroes seems close to a Poisson-inverse Gaussian (see, Esnaola et al., 2013, Fig. 4). In the vignette of tweeDEseq you can find how to do goodness of fit tests to every row of a matrix of counts and produce a Q-Q plot to decide what fraction of genes follow what count distribution of your interest.

cheers,

robert.

ADD REPLY • link 10.3 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

A first diagnostic is to look at the scatterplot of counts between replicates, and check the frequency of having a very large count in one replicate and a zero in another replicate, for the same gene. However, I don't know about a quantitative diagnostic that would then help you objectively decide whether the data are zero-inflated or not. (And see the fortune(234) quote below, which also applies here - i.e. the question is not whether zero-inflation is detectable but whether it's bad enough to distort the inference.)

Models that explicitly model the data as a mixture of a point mass at zero and another, more disperse distribution are interesting - but I wonder whether in those cases where they would apply, the real data doesn't also have an excess of other small numbers (e.g. 1, 2, ..) and how they handle that?

library("fortunes")

fortune(234)

The issue really comes down to the fact that the questions: "exactly normal?", and "normal enough?" are 2 very different questions (with the difference becoming greater with increased sample size) and while the first is the easier to answer, the second is generally the more useful one.
   -- Greg Snow (answering a question about a "normality test" 
      suitable for large data)
      R-help (April 2009)

ADD REPLY • link 10.3 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

@Simon, wouldn't zeros from sequencing "errors" or being under threshold or something like that constitute exactly a zero-inflated model? x = ifelse(<zero for sequencing reason>, 0, <real distribution>)

ADD REPLY • link 6.2 years ago ariel ▴ 20

0

Entering edit mode

BTW this thread is quite old, here are some relevant links since 4 years ago:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4

https://bioconductor.org/packages/release/bioc/vignettes/zinbwave/inst/doc/intro.html#differential-expression-with-deseq2

https://github.com/mikelove/zinbwave-deseq2/blob/master/zinbwave-deseq2.knit.md

The question remains whether a given dataset requires a zero component, but if you do require it, we have built out the infrastructure.

ADD REPLY • link 6.2 years ago Michael Love 43k

score 0 · Answer 3 · 2015-11-05

DESeq2 and edgeR both use only the negative binomial distribution and do not support zero inflation, as far as I know. However, are you sure your data is zero-inflated? The NB distribution allows for a certain amount of zeros in the data on its own. Just having zeros in your data does not make it "zero-inflated". Zero-inflation would mean that the non-zero data follows a NB distribution, but the number of zeros is in excess of what would be predicted from the NB.

If your really believe you have zero-inflated data, the only package I've heard of for analyzing RNA-seq data using the ZI-NB distribution is ShrinkBayes, but its website seems to be down now.