19 months ago by
Cambridge, United Kingdom
I'm not sure calling it "single-cell" is appropriate, given that you're using pools. I don't have any better naming suggestions, though, so I'll just pretend that you're sequencing single cells. In any case, pooling across 2-10 cells will not get you close to bulk data - at least, not in my experience. I've had to pool all cells on a plate (around 80) by summing the counts for all cells. When I do so and try to analyze it with edgeR, I get something equivalent to a very noisy bulk data set, based on the NB dispersion estimates.
Anyway, there are plenty of dedicated single-cell analysis methods on Bioconductor, depending on what you want to do. scater and cellity handle quality control, monocle builds pseudo-temporal orderings for biological processes, scde does differential expression and variable gene set testing, etc. I'll give a plug for my own package, scran, that handles low-level analyses of scRNA-seq data. So, you're not limited to using bulk-based methods if you have single-cell RNA-seq data lying around that needs to be analysed.
That said, if you want to do DE analyses, I find that edgeR works pretty well on single-cell counts. There's a couple of issues that you need to work around, though. Firstly, the mean-dispersion trend doesn't fit well because the dispersions are so large and variable across genes. Fortunately, if you have enough cells, you can estimate the dispersion reliably without needing to do EB shrinkage towards the trend. The LRT can then be used to test for DE. Secondly, TMM falls apart when you have high dropout rates and many zeroes, so you need to normalize with something a bit more robust (here's a plug for the
computeSumFactors function in scran).
Mike mentioned something about zero inflation in his answer. While it's true that there's a lot of dropouts in scRNA-seq data, and that this could be better modelled with zero-inflated models, I find that the standard NB model in edgeR (and presumably also in DESeq2) actually does an okay job. This is because the NB dispersion is so high anyway, due to technical noise, amplification biases, etc. that you end up with a substantial probability mass at zero, even without any explicit zero inflation. That's not to say that ZINB won't do better; I'm just saying that the vanilla NB approach isn't disastrously wrong. (There is, of course, the pathological case where a low-variance, high-abundance gene has lots of zero counts due to a subpopulation of cells in which the gene is silent. In such cases, a ZINB model would clearly be better; however, I would question the wisdom of treating these cells as replicates at all.)
Finally, if you're planning the experiment, I would suggest a couple of things:
- Do enough cells. Single-cell data is noisy and unreliable per cell, so the solution is to just do it for more cells and share information across cells to improve the reliability of the analysis. In your case, these are pools of cells, so I'll talk in terms of wells (for plate-based protocols for SMART-seq2; or reaction chambers, for the C1; or tubes, for other protocols). I'd suggest at least a full plate with 96 wells, and preferably multiple plates with cells taken from replicate animals.
- Add spike-ins to each well. Yes, ERCCs are much maligned, but they do a couple of things - one, they tell you if sequencing worked at all for each well, and two, they give you a measure of the technical noise. The latter is important if you want to decompose the variance to get the biological variability, e.g., to identify HVGs that might be segregating subpopulations.
- UMI data is a lot less variable as it avoids amplification biases. I don't know to what extent this improves the quality of the downstream analysis, but that might be something to think about.
modified 19 months ago
19 months ago by
Aaron Lun • 17k