Question

Downsample or not to downsample?

2

Entering edit mode

p.joshi ▴ 40

@pjoshi-22718

Last seen 3.6 years ago

Germany

Hi,

I am trying to compare if down-sampling reads will affect by multi-dataset batch integration. I am using the scran, scater and batchelor tools in my pipeline.

1) First, I have processed each dataset to get normalized values based on clusters. 2) Then I am using multiBatchNorm to rescale normalized value across datasets 3) Finally I am performing fastMNN based batch correction and using the corrected dimensions to plot UMAP distribution. I am getting expected integration of cells from biological replicate batches, based on a quick overview. I haven't performed annotation yet, so can't say for certain.

To perform marker analysis, I was going to use both findMarker and edgeR based approaches. As edgeR requires raw counts, I am worried if counts from deeply sequenced time points will affect the analysis. I also wanted to compare, how down-sampling would reproduce results of the data integration analysis.

I thought downsampleBatches would be appropriate strategy for what I was planning to do. To do that, I first extracted counts of each dataset, and got a downsampled count matrix (hoping this is in proportion to the lowest depth sample). Next I tried calculating new size factors for these new count matrices, again following cluster based size factor estimation. Each dataset gave a warning about negative size factor estimation in computeSumFactors, for which I got an explanation from the function page. However, for few datasets, even quickCluster is giving negative size factor error and then fails to run.

So my question is, is it necessary to check effect of down-sampling? The depth of my individual datasets range from ~30k reads per cell to 350k reads per cell. Most of them are around 60-70k but the outliers on high end are 170k, 240k and 350k reads per sample. If I should, what would be a better strategy? I also checked downsampleReads, but it requires and HDF5 file, which my run on STARsolo doesn't produce.

Could I get the cluster annotation from full dataset analysis and use the downsampled data for edgeR based differential gene expression analysis? That way I won't have to normalization of downsampled data, however, I don't know if the analysis would stay similar. Hence, I just wanted some opinion about my strategy and alternative approaches that could help me with downsampling, if I should do that for my dataset.

Thanks, Piyush

scrna-seq downsampleBatches scran • 8.5k views

ADD COMMENT • link updated 5.7 years ago by Aaron Lun ★ 29k • written 5.7 years ago by p.joshi ▴ 40

score 5 · Accepted Answer · 2020-05-02

So my question is, is it necessary to check effect of down-sampling?

I wouldn't bother. Downsampling has some uses, specifically in the definitive removal of library size-associated trends that are driven by differences in variance (and thus cannot be removed by simple scaling normalization). This might be helpful when you're trying to double-check relatively subtle effects, e.g., trajectories and such.

However, it's difficult to be comfortable with the amount of information thrown away in the process if we were to do this routinely. For example, the negative size factor warnings from computeSumFactors are probably due to the fact that the downsampled counts are too low. (Note that the definition of "too low" depends on the variance; a 30k total UMI count would probably be fine.) Moreover, if your downsampling results disagree with your original results, you'll never know if that was because the latter was an artifact of differences in depth between batches or if the downsampling threw away resolution of genuine biology.

Now, if you see the same results before and after downsampling, you might feel a little bit more confident in the results. But only a little - differences in sequencing depth between batches are probably the least of your concerns. (To list a few - assumptions from MNN-based batch correction; clustering resolution and ambiguity; suboptimal feature/PC selection.) You could spend months fiddling with all (combinations of) those parameters and checking whether they have an effect on the results, but at some point, it is literally cheaper to just check the conclusions with an appropriately designed follow-up experiment.

Could I get the cluster annotation from full dataset analysis and use the downsampled data for edgeR based differential gene expression analysis?

Sure, you could do that, but edgeR is even more robust to differences in library size than your average scRNA-seq analysis, given that the former actually uses a model to handle the mean-variance relationship of count data. So if you're willing to trust the results from the full dataset analysis, there's no reason to distrust the edgeR analysis.