multiBatchNorm vs. calculateSumFactors on concatenated dataset
1
0
Entering edit mode
@angelos-armen-21507
Last seen 7 months ago

Hi,

The "standard" way to normalise across batches is to use scran::calculateSumFactors on each batch followed by a call to batchelor::multiBatchNorm. But how about concatenating the batches, calling scran::quickCluster with block specifying the batch, and finally calling calculateSumFactors on the concatenated dataset with the clusters from quickCluster? Using multiBatchNorm requires that less than 50% of the averaged genes are differential across pairs of batches, while the other strategy requires the weaker assumption that less than 50% of the averaged genes are differential across pairs of clusters.

On a related note, the default value (1) of min.mean in multiBatchNorm is used in Orchestrating Single-Cell Analysis with Bioconductor for integrating UMI data, while calculateSumFactors uses min.mean = 0.1 by default for such data. Is there a reason for that? One reason I could think of is that the most sparse cluster is more sparse than the most sparse batch, so a smaller value of min.mean is needed. Using multiBatchNorm with min.mean = 1 seems indeed to give me better clustering results (after batchelor::fastMNN correction) than using min.mean = 0.1.

batchelor scran • 482 views
1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 3 hours ago
The city by the bay

But how about concatenating the batches, calling scran::quickCluster with block specifying the batch, and finally calling calculateSumFactors on the concatenated dataset with the clusters from quickCluster?

Sure, you can do that if the batches are all from similar technologies. multiBatchNorm() was initially built to deal with the painful situation of merging Smart-seq2 and UMI-based data. For most applications, I suspect multiBatchNorm() is not really necessary, but it doesn't seem to hurt and so I just run it so that my merging pipelines will work regardless of what crazy datasets I throw in.

Mind you, I wouldn't be so sure that the assumption for calculateSumFactors() is weaker. It's true that we only require a non-DE majority between pairs of clusters, but if there's a strong batch effect, each batch will form separate clusters. This means you'll eventually be working with pairs of clusters from different batches, so the DEGs due to the batch effect will add onto any DEGs from the cluster-to-cluster differences. In contrast, multiBatchNorm() only cares about DEGs between the averages of the batches; so, if the cell type composition doesn't change between batches, then we only have to worry about the batch-induced DEGs.

In terms of the bigger picture, though, I don't think it matters all that much; these details are relatively minor compared to the heavy distortions to the data introduced by MNN correction and its contemporaries.

Is there a reason for that?

I must admit that I don't really remember all that much. If I had to say, we probably used a higher filter for multiBatchNorm() because we were potentially dealing with read count data + UMI count data, and I erred on the side of having a higher threshold to match the higher counts in the former. (At the same magnitude, read counts are noisier than UMI counts, hence this need for adjustment when filtering for informative genes.)

Using multiBatchNorm with min.mean = 1 seems indeed to give me better clustering results (after batchelor::fastMNN correction) than using min.mean = 0.1.

I don't really have any idea of why this might be, so... ¯\_(ツ)_/¯

If you're curious, you can probably calculate the scaling applied to the size factors for each batch. As in, take the sizeFactors() before running multiBatchNorm(), and then use them to divide the size factors in the output objects. The ratio will be constant for all cells in each batch, but different across batches; I would be interested to know whether you see some notable differences for min.mean=1 versus min.mean=0.1.

0
Entering edit mode

Thank you Aaron.

Mind you, I wouldn't be so sure that the assumption for calculateSumFactors() is weaker. It's true that we only require a non-DE majority between pairs of clusters, but if there's a strong batch effect, each batch will form separate clusters. This means you'll eventually be working with pairs of clusters from different batches, so the DEGs due to the batch effect will add onto any DEGs from the cluster-to-cluster differences. In contrast, multiBatchNorm() only cares about DEGs between the averages of the batches; so, if the cell type composition doesn't change between batches, then we only have to worry about the batch-induced DEGs.

Indeed, but the composition does change between my batches. Also I proposed using quickCluster with block equal to the batch, so the clusters will be batch-specific by design.

I would be interested to know whether you see some notable differences for min.mean=1 versus min.mean=0.1.

Here you go:

min.mean = 1: 1.562508 5.834937 1.708959 2.902615 1.426514 4.478574 1.000000 3.274636 1.144713 3.347561

min.mean = 0.1: 1.630960 4.565926 1.676077 2.556120 1.340033 3.366067 1.000000 2.709936 1.203517 2.916712

After lots of experimentation, I got the best clustering results after downsampling the batches using DropletUtils::downsampleBatches and applying the concatenation strategy for normalisation.

0
Entering edit mode

Yes, downsampling is the theoretically safest approach in terms of equalizing coverage without distorting the mean-variance relationship. I'm reluctant to recommend it as the default because it throws away so much information, but I am not too surprised that it does well. I guess you've already read the relevant section of the OSCA book, but here's another one that may be of interest.