Question

Zero inflation in RUV normalization and DE for single cell data

0

Entering edit mode

Nik Tuzov ▴ 90

@nik-tuzov-8783

Last seen 22 months ago

United States

Some single cell packages (ZIFA, CIDR) make a point of addressing the zero inflation in single cell data. Since RUV is also used for single cell, I have two questions:

1) Would it be beneficial (at least, in theory) to add zero inflated models to RUVseq and RUVnormalize?

2) This is a more theoretical question. I think that bulk RNA data are also zero inflated, if less than single cell data, but still bulk RNA could have used zero inflated models successfully. However, since the sample size for bulk RNA is small, there was simply to way to fit the extra parameters. With single cell data the number of observations is much larger and that's the main reason why it became meaningful to consider the issue of zero inflation. Did I get that right?

Regards,

Nik Tuzov

ruv ruvseq ruvnormalize single-cell zero-inflated • 3.5k views

ADD COMMENT • link updated 8.3 years ago by davide risso ▴ 980 • written 8.3 years ago by Nik Tuzov ▴ 90

score 1 · Answer 1 · 2017-08-09

1

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 52 minutes ago

The city by the bay

I'll leave others to answer number 1, but I'll poke my head in for question 2.

Consider the cause of the zero inflation in single-cell RNA-seq data. It's due to a combination of low input RNA, low capture efficiency and strong amplification of captured transcripts to obtain enough cDNA for sequencing. This means that successful capture (and subsequent amplification) of a small number of transcripts in a few cells results in a separate distribution of large counts along with lots of zero counts corresponding to cells in which the transcripts failed to be captured. Indeed, if you get rid of the amplification effects with UMIs, you find that the zero inflation is greatly attenuated - possibly still present due to biological heterogeneity, but that's another story.

In bulk RNA-seq, these considerations are not particularly relevant as you have high input quantities of RNA. This results in high-complexity libraries, reducing the chance of sequencing multiple amplicons of the same original cDNA molecule. Bulk populations also have more stable average expression profiles than single cells, so there's less chance of getting one replicate with zero and the others with large non-zero counts. Obviously, you will always get some zeros when your mean is close to zero - this is already handled by count-based models with no need for an extra zero inflation term.

Any zero inflation would manifest as large dispersion estimates in negative binomial models. I haven't seen that in bulk RNA-seq data - or specifically, I haven't seen that in a way that is caused by an excess of zeros (as large dispersions tend to be caused by a spread of non-zero counts); or when I have seen it, it's usually caused by something biological, e.g., I forgot to block on sex and Xist is now "zero-inflated". Clearly, the better solution in the latter case would be to block on that factor, or find hidden factors... with RUVseq.

ADD COMMENT • link 8.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks for replying. One way to rephrase 2) is that when the number of observations is low (typical for bulk RNA) then it's hard to say whether zeros are due to zero inflation or just to NB/Poisson distribution having a low mean. That's one good reason why you haven't seen compelling evidence in favor of zero-inflated models. I believe if it were possible to increase the sample size in bulk RNA to the same level as in scRNA, zero-inflated models would quickly gain popularity in bulk RNA studies.

ADD REPLY • link 8.3 years ago Nik Tuzov ▴ 90

0

Entering edit mode

I'm not sure that sample size has much to do with this. Low amount of RNA and amplification bias seems more relevant. The first scRNA-seq dataset that I worked with had 10 cells and there was no doubt a large amount of zero inflation.

Anyway, re: zero-inflation and bulk RNA-seq, this is an interesting paper:

https://academic.oup.com/biostatistics/article/14/1/113/250560/Bayesian-analysis-of-RNA-sequencing-data-by?keytype=ref&ijkey=GDoEiRTJTP8Ed3o

ADD REPLY • link 8.3 years ago davide risso ▴ 980

0

Entering edit mode

What tool did you use to measure zero inflation?

ADD REPLY • link 8.3 years ago Nik Tuzov ▴ 90

0

Entering edit mode

We did a few goodness-of-fit plot and it seemed that the negative binomial model was underestimating the number of zeros.

ADD REPLY • link 8.3 years ago davide risso ▴ 980

0

Entering edit mode

It must be very hard to get compelling evidence in favor of any particular distribution with just 10 observations. In zero inflated case the only scenario I can think of is having 9 zeros out of 10, with the 10th observation being very large.

ADD REPLY • link 8.3 years ago Nik Tuzov ▴ 90

0

Entering edit mode

Well it's not quite like 10 observations since you have data for ~10,000 genes and you can take advantage of the mean-variance relation expected from the negative binomial distribution to look at goodness-of-fit. Anyway, the data are public so you can play around with the data yourself: https://www.ncbi.nlm.nih.gov/pubmed/24299736

It has a nice pool/split experiment that can be used to tell apart biological and technical variation.

ADD REPLY • link 8.3 years ago davide risso ▴ 980

0

Entering edit mode

Would a zero-inflated model fit better? Maybe, maybe not. Would it quickly gain popularity? This discussion is somewhat academic, but I doubt it. Current experimental designs for RNA-seq with low (3-5) numbers of replicates are very cost-effective for their intended purpose; to screen for interesting candidate genes for further functional studies. I would have a tough time convincing collaborators to generate 20-50 replicates for a bulk RNA-seq experiment, just so I could model zero inflation. (This wouldn't just be 20 separate library preps; it would be 20 separate cell cultures/mice/treatments/etc., which is probably the most expensive part of the process nowadays.) Scientifically speaking, the money is better spent elsewhere.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 29k

score 1 · Answer 2 · 2017-08-09

I'll try to answer the first question.

I think it is beneficial to consider zero-infated distributions to model single-cell data. In fact, we developed the zinbwave package that implements a zero-inflated negative binomial model. However, we generally use that model for another situation, in which what is unknown is the factor of interest (i.e., the "wanted" variation), and this is because in our experience most of the times the first pass in a single-cell data analysis is to find the low-dimensional signal of interest in the data (e.g., clustering or pseudo time ordering).

It is technically possible to estimate the factors of unwanted variation with the zinbwave model in a similar way to what RUVSeq does, e.g., by fitting the model only on the negative control genes: the resulting W would have the same interpretation of RUV's W.

For unsupervised problems, however, the RUVSeq model doesn't work well because it may be risky to naively remove unwanted variation from the data without controlling for the signal of interest. RUVnormalize and the CRAN package ruv implement more sophisticated variations of RUV which work well for unsupervised problems, but are harder to generalize to zero-inflated model.

I hope this answered your question, although I realize that it is not a very satisfactory answer.