But how about concatenating the batches, calling scran::quickCluster with block specifying the batch, and finally calling calculateSumFactors on the concatenated dataset with the clusters from quickCluster?
Sure, you can do that if the batches are all from similar technologies.
multiBatchNorm() was initially built to deal with the painful situation of merging Smart-seq2 and UMI-based data. For most applications, I suspect
multiBatchNorm() is not really necessary, but it doesn't seem to hurt and so I just run it so that my merging pipelines will work regardless of what crazy datasets I throw in.
Mind you, I wouldn't be so sure that the assumption for
calculateSumFactors() is weaker. It's true that we only require a non-DE majority between pairs of clusters, but if there's a strong batch effect, each batch will form separate clusters. This means you'll eventually be working with pairs of clusters from different batches, so the DEGs due to the batch effect will add onto any DEGs from the cluster-to-cluster differences. In contrast,
multiBatchNorm() only cares about DEGs between the averages of the batches; so, if the cell type composition doesn't change between batches, then we only have to worry about the batch-induced DEGs.
In terms of the bigger picture, though, I don't think it matters all that much; these details are relatively minor compared to the heavy distortions to the data introduced by MNN correction and its contemporaries.
Is there a reason for that?
I must admit that I don't really remember all that much. If I had to say, we probably used a higher filter for
multiBatchNorm() because we were potentially dealing with read count data + UMI count data, and I erred on the side of having a higher threshold to match the higher counts in the former. (At the same magnitude, read counts are noisier than UMI counts, hence this need for adjustment when filtering for informative genes.)
Using multiBatchNorm with min.mean = 1 seems indeed to give me better clustering results (after batchelor::fastMNN correction) than using min.mean = 0.1.
I don't really have any idea of why this might be, so... ¯\_(ツ)_/¯
If you're curious, you can probably calculate the scaling applied to the size factors for each batch. As in, take the
sizeFactors() before running
multiBatchNorm(), and then use them to divide the size factors in the output objects. The ratio will be constant for all cells in each batch, but different across batches; I would be interested to know whether you see some notable differences for