Question

multiBatchNorm vs. calculateSumFactors on concatenated dataset

0

Entering edit mode

Angelos Armen • 0

@angelos-armen-21507

Last seen 2.6 years ago

United Kingdom

Hi,

The "standard" way to normalise across batches is to use scran::calculateSumFactors on each batch followed by a call to batchelor::multiBatchNorm. But how about concatenating the batches, calling scran::quickCluster with block specifying the batch, and finally calling calculateSumFactors on the concatenated dataset with the clusters from quickCluster? Using multiBatchNorm requires that less than 50% of the averaged genes are differential across pairs of batches, while the other strategy requires the weaker assumption that less than 50% of the averaged genes are differential across pairs of clusters.

On a related note, the default value (1) of min.mean in multiBatchNorm is used in Orchestrating Single-Cell Analysis with Bioconductor for integrating UMI data, while calculateSumFactors uses min.mean = 0.1 by default for such data. Is there a reason for that? One reason I could think of is that the most sparse cluster is more sparse than the most sparse batch, so a smaller value of min.mean is needed. Using multiBatchNorm with min.mean = 1 seems indeed to give me better clustering results (after batchelor::fastMNN correction) than using min.mean = 0.1.

batchelor scran • 1.9k views

ADD COMMENT • link updated 3.9 years ago by Aaron Lun ★ 28k • written 3.9 years ago by Angelos Armen • 0

score 2 · Accepted Answer · 2020-08-20

But how about concatenating the batches, calling scran::quickCluster with block specifying the batch, and finally calling calculateSumFactors on the concatenated dataset with the clusters from quickCluster?

Sure, you can do that if the batches are all from similar technologies. multiBatchNorm() was initially built to deal with the painful situation of merging Smart-seq2 and UMI-based data. For most applications, I suspect multiBatchNorm() is not really necessary, but it doesn't seem to hurt and so I just run it so that my merging pipelines will work regardless of what crazy datasets I throw in.

Mind you, I wouldn't be so sure that the assumption for calculateSumFactors() is weaker. It's true that we only require a non-DE majority between pairs of clusters, but if there's a strong batch effect, each batch will form separate clusters. This means you'll eventually be working with pairs of clusters from different batches, so the DEGs due to the batch effect will add onto any DEGs from the cluster-to-cluster differences. In contrast, multiBatchNorm() only cares about DEGs between the averages of the batches; so, if the cell type composition doesn't change between batches, then we only have to worry about the batch-induced DEGs.

In terms of the bigger picture, though, I don't think it matters all that much; these details are relatively minor compared to the heavy distortions to the data introduced by MNN correction and its contemporaries.

Is there a reason for that?

I must admit that I don't really remember all that much. If I had to say, we probably used a higher filter for multiBatchNorm() because we were potentially dealing with read count data + UMI count data, and I erred on the side of having a higher threshold to match the higher counts in the former. (At the same magnitude, read counts are noisier than UMI counts, hence this need for adjustment when filtering for informative genes.)

Using multiBatchNorm with min.mean = 1 seems indeed to give me better clustering results (after batchelor::fastMNN correction) than using min.mean = 0.1.

I don't really have any idea of why this might be, so... ¯\_(ツ)_/¯

If you're curious, you can probably calculate the scaling applied to the size factors for each batch. As in, take the sizeFactors() before running multiBatchNorm(), and then use them to divide the size factors in the output objects. The ratio will be constant for all cells in each batch, but different across batches; I would be interested to know whether you see some notable differences for min.mean=1 versus min.mean=0.1.