6 months ago by

Cambridge, United Kingdom

Before we start: the input should be scale-normalized prior to supplying them to `mnnCorrect`

(e.g., using library sizes or, even better, size factors). Maybe the documentation for `mnnCorrect`

could have been more clear in this respect - but if you're using the *scater*/*scran* framework, you should have already scale-normalized via `normalize`

by the time you get log-expression values. More generally, I've never encountered as log-transformed expression matrix that hasn't been scale-normalized in some manner (e.g., log-CPMs, log-TPMs).

Now, onto your question. `mnnCorrect`

performs cosine normalization internally to adjust for differences in the size of the counts between batches (e.g., due to sequencing depth or capture efficiency or what have you). By default, the corrected values returned by the function are also on the cosine scale. This cannot be interpreted on the scale of the original log-expression values, even if you run `exp`

on the output values.

You *could* request the function to return corrected values on the original scale by setting `cos.norm.out=FALSE`

, and run `exp`

on that. However, I would not advise using these values in any downstream statistical model relying on raw counts. It's for much the same reason that we don't scale the counts directly in *edgeR* during normalization; doing so would distort the mean-variance relationship, making it possible to get sub-Poisson variation. ThisĀ generally should *not* be possible, even with UMIs, due to low capture efficiencies. I doubt that `mnnCorrect`

(or really, any correction method) will preserve the mean-variance relationship in the count data.

Perhaps some consideration of the bigger picture may be helpful. By the time you get output from `mnnCorrect`

, you should have already performed two rounds of cell-specific scaling normalization (size factors, cosine normalization) and the gene-specific batch correction itself. Do you really need more normalization on top of that? What biases are you encountering that lead you to think that more work is necessary? We use the corrected output from `mnnCorrect`

directly in clustering, and that seems to work quite well.

•

link
modified 6 months ago
•
written
6 months ago by
Aaron Lun • **20k**