14 months ago by
Cambridge, United Kingdom
I'll leave others to answer number 1, but I'll poke my head in for question 2.
Consider the cause of the zero inflation in single-cell RNA-seq data. It's due to a combination of low input RNA, low capture efficiency and strong amplification of captured transcripts to obtain enough cDNA for sequencing. This means that successful capture (and subsequent amplification) of a small number of transcripts in a few cells results in a separate distribution of large counts along with lots of zero counts corresponding to cells in which the transcripts failed to be captured. Indeed, if you get rid of the amplification effects with UMIs, you find that the zero inflation is greatly attenuated - possibly still present due to biological heterogeneity, but that's another story.
In bulk RNA-seq, these considerations are not particularly relevant as you have high input quantities of RNA. This results in high-complexity libraries, reducing the chance of sequencing multiple amplicons of the same original cDNA molecule. Bulk populations also have more stable average expression profiles than single cells, so there's less chance of getting one replicate with zero and the others with large non-zero counts. Obviously, you will always get some zeros when your mean is close to zero - this is already handled by count-based models with no need for an extra zero inflation term.
Any zero inflation would manifest as large dispersion estimates in negative binomial models. I haven't seen that in bulk RNA-seq data - or specifically, I haven't seen that in a way that is caused by an excess of zeros (as large dispersions tend to be caused by a spread of non-zero counts); or when I have seen it, it's usually caused by something biological, e.g., I forgot to block on sex and Xist is now "zero-inflated". Clearly, the better solution in the latter case would be to block on that factor, or find hidden factors... with RUVseq.
modified 14 months ago
14 months ago by
Aaron Lun • 21k