Downsample or not to downsample?
Entering edit mode
p.joshi ▴ 30
Last seen 7 months ago


I am trying to compare if down-sampling reads will affect by multi-dataset batch integration. I am using the scran, scater and batchelor tools in my pipeline.

1) First, I have processed each dataset to get normalized values based on clusters. 2) Then I am using multiBatchNorm to rescale normalized value across datasets 3) Finally I am performing fastMNN based batch correction and using the corrected dimensions to plot UMAP distribution. I am getting expected integration of cells from biological replicate batches, based on a quick overview. I haven't performed annotation yet, so can't say for certain.

To perform marker analysis, I was going to use both findMarker and edgeR based approaches. As edgeR requires raw counts, I am worried if counts from deeply sequenced time points will affect the analysis. I also wanted to compare, how down-sampling would reproduce results of the data integration analysis.

I thought downsampleBatches would be appropriate strategy for what I was planning to do. To do that, I first extracted counts of each dataset, and got a downsampled count matrix (hoping this is in proportion to the lowest depth sample). Next I tried calculating new size factors for these new count matrices, again following cluster based size factor estimation. Each dataset gave a warning about negative size factor estimation in computeSumFactors, for which I got an explanation from the function page. However, for few datasets, even quickCluster is giving negative size factor error and then fails to run.

So my question is, is it necessary to check effect of down-sampling? The depth of my individual datasets range from ~30k reads per cell to 350k reads per cell. Most of them are around 60-70k but the outliers on high end are 170k, 240k and 350k reads per sample. If I should, what would be a better strategy? I also checked downsampleReads, but it requires and HDF5 file, which my run on STARsolo doesn't produce.

Could I get the cluster annotation from full dataset analysis and use the downsampled data for edgeR based differential gene expression analysis? That way I won't have to normalization of downsampled data, however, I don't know if the analysis would stay similar. Hence, I just wanted some opinion about my strategy and alternative approaches that could help me with downsampling, if I should do that for my dataset.

Thanks, Piyush

scrna-seq downsampleBatches scran • 2.1k views
Entering edit mode
Aaron Lun ★ 28k
Last seen 15 hours ago
The city by the bay

So my question is, is it necessary to check effect of down-sampling?

I wouldn't bother. Downsampling has some uses, specifically in the definitive removal of library size-associated trends that are driven by differences in variance (and thus cannot be removed by simple scaling normalization). This might be helpful when you're trying to double-check relatively subtle effects, e.g., trajectories and such.

However, it's difficult to be comfortable with the amount of information thrown away in the process if we were to do this routinely. For example, the negative size factor warnings from computeSumFactors are probably due to the fact that the downsampled counts are too low. (Note that the definition of "too low" depends on the variance; a 30k total UMI count would probably be fine.) Moreover, if your downsampling results disagree with your original results, you'll never know if that was because the latter was an artifact of differences in depth between batches or if the downsampling threw away resolution of genuine biology.

Now, if you see the same results before and after downsampling, you might feel a little bit more confident in the results. But only a little - differences in sequencing depth between batches are probably the least of your concerns. (To list a few - assumptions from MNN-based batch correction; clustering resolution and ambiguity; suboptimal feature/PC selection.) You could spend months fiddling with all (combinations of) those parameters and checking whether they have an effect on the results, but at some point, it is literally cheaper to just check the conclusions with an appropriately designed follow-up experiment.

Could I get the cluster annotation from full dataset analysis and use the downsampled data for edgeR based differential gene expression analysis?

Sure, you could do that, but edgeR is even more robust to differences in library size than your average scRNA-seq analysis, given that the former actually uses a model to handle the mean-variance relationship of count data. So if you're willing to trust the results from the full dataset analysis, there's no reason to distrust the edgeR analysis.

Entering edit mode

Thanks Aaron. I feel confident of the overall data integration with fastMNN as I see similar cell types across stages getting clustered together based on some marker expression. I just wanted an expert opinion on downsampling.

I also have another question regarding differential gene expression. In our meetings, other bioinformaticians have suggested to compare similar number of cells acroos clusters for DGE. So if the smallest cluster has 100 cells, other clusters should be reduced to this cell number size, to explore the stability of DGE markers. One way was to use bootstraps of cells from bigger cluster and combine results (taking avergae of lfc and pvalue?). I didn't find their opinion incorrect, as in bulk RNA-seq DGE I have experienced it is better to have balanced comparison, do you think that it is a good approach for marker analysis in single cell data. I haven't see any tutorials of that, so I was wondering if difference in cell number size is big issue for marker analysis.

Entering edit mode

We explored this idea to some extent in the Biostatistics paper. I'll assume you are following a pseudo-bulk strategy, given that edgeR wouldn't be able to handle sample-level variability for per-cell counts.

tl;dr Don't worry about differences in cell number.

The main effect from such differences is that the sum from counts with more cells is more precise. In theory, this is not ideal because it means that different observations for the same gene would have different dispersions, whereas edgeR assumes that all observations have the same dispersion. In practice, this doesn't matter (much) for a variety of reasons:

  • The precision of the sum converges to Poisson with many cells, at which point any differences in the true dispersion across observations become small enough that they have no effect. "Many" being as low as 20, if I remember from the manuscript.
  • The precision of the sum becomes a minor contributor to the between-replicate variability of the sums, which is instead driven by biological differences between your samples (not cells!). Increasing the number of cells in these pseudo-bulk samples is analogous to increasing your sequencing depth in actual bulk samples; past a certain point you have removed all of the technical noise but the biological variance remains.
  • edgeR seems pretty robust to violations of its "same dispersion" assumption.

There are, of course, cases where you don't have enough cells so the three points above do not apply. However, in such cases, I would say that downsampling the number of cells involves throwing away so much information that the solution is almost as bad as the problem.


Login before adding your answer.

Traffic: 409 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6