Hi,

I am trying to compare if down-sampling reads will affect by multi-dataset batch integration. I am using the `scran`

, `scater`

and `batchelor`

tools in my pipeline.

1) First, I have processed each dataset to get normalized values based on clusters.
2) Then I am using `multiBatchNorm`

to rescale normalized value across datasets
3) Finally I am performing `fastMNN`

based batch correction and using the corrected dimensions to plot UMAP distribution. I am getting expected integration of cells from biological replicate batches, based on a quick overview. I haven't performed annotation yet, so can't say for certain.

To perform marker analysis, I was going to use both `findMarker`

and `edgeR`

based approaches. As `edgeR`

requires raw counts, I am worried if counts from deeply sequenced time points will affect the analysis. I also wanted to compare, how down-sampling would reproduce results of the data integration analysis.

I thought `downsampleBatches`

would be appropriate strategy for what I was planning to do. To do that, I first extracted counts of each dataset, and got a downsampled count matrix (hoping this is in proportion to the lowest depth sample). Next I tried calculating new size factors for these new count matrices, again following cluster based size factor estimation. Each dataset gave a warning about negative size factor estimation in `computeSumFactors`

, for which I got an explanation from the function page. However, for few datasets, even `quickCluster`

is giving negative size factor error and then fails to run.

So my question is, is it necessary to check effect of down-sampling? The depth of my individual datasets range from ~30k reads per cell to 350k reads per cell. Most of them are around 60-70k but the outliers on high end are 170k, 240k and 350k reads per sample. If I should, what would be a better strategy? I also checked `downsampleReads`

, but it requires and HDF5 file, which my run on STARsolo doesn't produce.

Could I get the cluster annotation from full dataset analysis and use the downsampled data for edgeR based differential gene expression analysis? That way I won't have to normalization of downsampled data, however, I don't know if the analysis would stay similar. Hence, I just wanted some opinion about my strategy and alternative approaches that could help me with downsampling, if I should do that for my dataset.

Thanks, Piyush

Thanks Aaron. I feel confident of the overall data integration with fastMNN as I see similar cell types across stages getting clustered together based on some marker expression. I just wanted an expert opinion on downsampling.

I also have another question regarding differential gene expression. In our meetings, other bioinformaticians have suggested to compare similar number of cells acroos clusters for DGE. So if the smallest cluster has 100 cells, other clusters should be reduced to this cell number size, to explore the stability of DGE markers. One way was to use bootstraps of cells from bigger cluster and combine results (taking avergae of lfc and pvalue?). I didn't find their opinion incorrect, as in bulk RNA-seq DGE I have experienced it is better to have balanced comparison, do you think that it is a good approach for marker analysis in single cell data. I haven't see any tutorials of that, so I was wondering if difference in cell number size is big issue for marker analysis.

We explored this idea to some extent in the

Biostatisticspaper. I'll assume you are following a pseudo-bulk strategy, given thatedgeRwouldn't be able to handle sample-level variability for per-cell counts.tl;drDon't worry about differences in cell number.The main effect from such differences is that the sum from counts with more cells is more precise. In theory, this is not ideal because it means that different observations for the same gene would have different dispersions, whereas

edgeRassumes that all observations have the same dispersion. In practice, this doesn't matter (much) for a variety of reasons:edgeRseems pretty robust to violations of its "same dispersion" assumption.There are, of course, cases where you don't have enough cells so the three points above do not apply. However, in such cases, I would say that downsampling the number of cells involves throwing away so much information that the solution is almost as bad as the problem.