Search
Question: Pool Single-cells: Analysis
0
2.7 years ago by
Belgium

Hello!

We are thinking doing RNA-seq on single-cells. Due to some constraints related to the biological model we are planning to pool a small amount of cells together (2 to 10) and perform RNA-seq on this pool. The experimental question behind is a classical transcriptome comparison of groups that would be impossible to do with unsorted bulk RNA-seq.

My questions are the following:

Would you see any issue to such experiment? My concerns are mainly related to the downstream analysis. Are the classical statistical programs such as DESeq2 (or the one developed for single-cell) applicable for such experiment which is in-between a true single-cell and a bulk RNA-seq experiment.

Moreover would you have any advices before performing the wet-lab experiment? Everything that might be important for the downstream analysis and that a wet-lab scientist would maybe miss.

modified 2.7 years ago by Aaron Lun21k • written 2.7 years ago by Radek40
3
2.7 years ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

I'm not sure calling it "single-cell" is appropriate, given that you're using pools. I don't have any better naming suggestions, though, so I'll just pretend that you're sequencing single cells. In any case, pooling across 2-10 cells will not get you close to bulk data - at least, not in my experience. I've had to pool all cells on a plate (around 80) by summing the counts for all cells. When I do so and try to analyze it with edgeR, I get something equivalent to a very noisy bulk data set, based on the NB dispersion estimates.

Anyway, there are plenty of dedicated single-cell analysis methods on Bioconductor, depending on what you want to do. scater and cellity handle quality control, monocle builds pseudo-temporal orderings for biological processes, scde does differential expression and variable gene set testing, etc. I'll give a plug for my own package, scran, that handles low-level analyses of scRNA-seq data. So, you're not limited to using bulk-based methods if you have single-cell RNA-seq data lying around that needs to be analysed.

That said, if you want to do DE analyses, I find that edgeR works pretty well on single-cell counts. There's a couple of issues that you need to work around, though. Firstly, the mean-dispersion trend doesn't fit well because the dispersions are so large and variable across genes. Fortunately, if you have enough cells, you can estimate the dispersion reliably without needing to do EB shrinkage towards the trend. The LRT can then be used to test for DE. Secondly, TMM falls apart when you have high dropout rates and many zeroes, so you need to normalize with something a bit more robust (here's a plug for the computeSumFactors function in scran).

Mike mentioned something about zero inflation in his answer. While it's true that there's a lot of dropouts in scRNA-seq data, and that this could be better modelled with zero-inflated models, I find that the standard NB model in edgeR (and presumably also in DESeq2) actually does an okay job. This is because the NB dispersion is so high anyway, due to technical noise, amplification biases, etc. that you end up with a substantial probability mass at zero, even without any explicit zero inflation. That's not to say that ZINB won't do better; I'm just saying that the vanilla NB approach isn't disastrously wrong. (There is, of course, the pathological case where a low-variance, high-abundance gene has lots of zero counts due to a subpopulation of cells in which the gene is silent. In such cases, a ZINB model would clearly be better; however, I would question the wisdom of treating these cells as replicates at all.)

Finally, if you're planning the experiment, I would suggest a couple of things:

• Do enough cells. Single-cell data is noisy and unreliable per cell, so the solution is to just do it for more cells and share information across cells to improve the reliability of the analysis. In your case, these are pools of cells, so I'll talk in terms of wells (for plate-based protocols for SMART-seq2; or reaction chambers, for the C1; or tubes, for other protocols). I'd suggest at least a full plate with 96 wells, and preferably multiple plates with cells taken from replicate animals.
• Add spike-ins to each well. Yes, ERCCs are much maligned, but they do a couple of things - one, they tell you if sequencing worked at all for each well, and two, they give you a measure of the technical noise. The latter is important if you want to decompose the variance to get the biological variability, e.g., to identify HVGs that might be segregating subpopulations.
• UMI data is a lot less variable as it avoids amplification biases. I don't know to what extent this improves the quality of the downstream analysis, but that might be something to think about.

interesting. thanks Aaron

Thanks this is a great summary! I don't want to absolutely use DESeq2/edgeR. It is just the one that I know the best currently. I'll follow your advices and have a deeper look to packages dedicated to single-cells.

Based on what you discuss, the number of cells pooled will be a key factor. We had two reasons for such pooling after sorting and before doing the RNA-seq: 1) Cost 2) Hope that it might increase the sensitivity for low counts.

But reading you I'm not sure that it would be so interesting to do. Basically which approach would you recommend in order to maximise our chances of success:

1) 3 animals per conditions from which we take 5-10 positive cells and pool them (Total= 9 samples)

OR

2) 3 animals per conditions from which we take 5-10 positive cells and sequence them separately (Total = 30 samples)

and finally does such low number of cells per conditions (15 to 30) will be enough for robust estimation of the count? The sorting will be quite complicated and I would like to do the best taking into account reproducibility and cost.

5-10 cells seem too low to me, you'll be dominated by technical noise in any analysis and you won't have enough power to really do anything. Unless these cells are incredibly rare or unique (e.g., neurons), I would aim for at least 30-50 cells from each animal, probably a full plate (i.e., 96 wells) per animal for convenience. If the only reason you're doing such small numbers is because of the cost - well, don't be stingy, because a good data set will inevitably cost money.

Unfortunately the cost is not the main reason. Those cells are really tricky to get. I'll see if it will be possible to screen 10 plates per animals to increase the number of harvested cells.

0
2.7 years ago by
Michael Love20k
United States
Michael Love20k wrote:

I haven't worked with single cell data, so I can't offer solid advice. I suppose you could download publicly available single cell data, and add together columns to see at what point the distribution of aggregated counts within each condition begin to look like within-condition distributions for bulk RNA-seq. There are lots caveats with this idea (library size differences, etc), but it would give you a very rough idea of what the data from your proposed approach might look like. My guess is that it would take many more that 10 cells, but this is really a wild guess given that I haven't worked on this data yet. DESeq2 does not work well with highly zero-inflated data, which cannot be captured by the negative binomial model. You'd be better off using a method that can accommodate zero inflation and there are dozens of scRNA-seq methods now.