Question

Pool Single-cells: Analysis

0

Entering edit mode

Radek ▴ 90

@radek-8889

Last seen 5.3 years ago

Belgium

Hello!

We are thinking doing RNA-seq on single-cells. Due to some constraints related to the biological model we are planning to pool a small amount of cells together (2 to 10) and perform RNA-seq on this pool. The experimental question behind is a classical transcriptome comparison of groups that would be impossible to do with unsorted bulk RNA-seq.

My questions are the following:

Would you see any issue to such experiment? My concerns are mainly related to the downstream analysis. Are the classical statistical programs such as DESeq2 (or the one developed for single-cell) applicable for such experiment which is in-between a true single-cell and a bulk RNA-seq experiment.

Moreover would you have any advices before performing the wet-lab experiment? Everything that might be important for the downstream analysis and that a wet-lab scientist would maybe miss.

Thanks in advance for your answers!

deseq2 single cell limma voom • 3.1k views

ADD COMMENT • link updated 8.0 years ago by Aaron Lun ★ 28k • written 8.0 years ago by Radek ▴ 90

score 3 · Answer 1 · 2016-04-12

I'm not sure calling it "single-cell" is appropriate, given that you're using pools. I don't have any better naming suggestions, though, so I'll just pretend that you're sequencing single cells. In any case, pooling across 2-10 cells will not get you close to bulk data - at least, not in my experience. I've had to pool all cells on a plate (around 80) by summing the counts for all cells. When I do so and try to analyze it with edgeR, I get something equivalent to a very noisy bulk data set, based on the NB dispersion estimates.

Anyway, there are plenty of dedicated single-cell analysis methods on Bioconductor, depending on what you want to do. scater and cellity handle quality control, monocle builds pseudo-temporal orderings for biological processes, scde does differential expression and variable gene set testing, etc. I'll give a plug for my own package, scran, that handles low-level analyses of scRNA-seq data. So, you're not limited to using bulk-based methods if you have single-cell RNA-seq data lying around that needs to be analysed.

That said, if you want to do DE analyses, I find that edgeR works pretty well on single-cell counts. There's a couple of issues that you need to work around, though. Firstly, the mean-dispersion trend doesn't fit well because the dispersions are so large and variable across genes. Fortunately, if you have enough cells, you can estimate the dispersion reliably without needing to do EB shrinkage towards the trend. The LRT can then be used to test for DE. Secondly, TMM falls apart when you have high dropout rates and many zeroes, so you need to normalize with something a bit more robust (here's a plug for the computeSumFactors function in scran).

Mike mentioned something about zero inflation in his answer. While it's true that there's a lot of dropouts in scRNA-seq data, and that this could be better modelled with zero-inflated models, I find that the standard NB model in edgeR (and presumably also in DESeq2) actually does an okay job. This is because the NB dispersion is so high anyway, due to technical noise, amplification biases, etc. that you end up with a substantial probability mass at zero, even without any explicit zero inflation. That's not to say that ZINB won't do better; I'm just saying that the vanilla NB approach isn't disastrously wrong. (There is, of course, the pathological case where a low-variance, high-abundance gene has lots of zero counts due to a subpopulation of cells in which the gene is silent. In such cases, a ZINB model would clearly be better; however, I would question the wisdom of treating these cells as replicates at all.)

Finally, if you're planning the experiment, I would suggest a couple of things:

Do enough cells. Single-cell data is noisy and unreliable per cell, so the solution is to just do it for more cells and share information across cells to improve the reliability of the analysis. In your case, these are pools of cells, so I'll talk in terms of wells (for plate-based protocols for SMART-seq2; or reaction chambers, for the C1; or tubes, for other protocols). I'd suggest at least a full plate with 96 wells, and preferably multiple plates with cells taken from replicate animals.
Add spike-ins to each well. Yes, ERCCs are much maligned, but they do a couple of things - one, they tell you if sequencing worked at all for each well, and two, they give you a measure of the technical noise. The latter is important if you want to decompose the variance to get the biological variability, e.g., to identify HVGs that might be segregating subpopulations.
UMI data is a lot less variable as it avoids amplification biases. I don't know to what extent this improves the quality of the downstream analysis, but that might be something to think about.

score 0 · Answer 2 · 2016-04-12

I haven't worked with single cell data, so I can't offer solid advice. I suppose you could download publicly available single cell data, and add together columns to see at what point the distribution of aggregated counts within each condition begin to look like within-condition distributions for bulk RNA-seq. There are lots caveats with this idea (library size differences, etc), but it would give you a very rough idea of what the data from your proposed approach might look like. My guess is that it would take many more that 10 cells, but this is really a wild guess given that I haven't worked on this data yet. DESeq2 does not work well with highly zero-inflated data, which cannot be captured by the negative binomial model. You'd be better off using a method that can accommodate zero inflation and there are dozens of scRNA-seq methods now.