Before we start: the input should be scale-normalized prior to supplying them to mnnCorrect
(e.g., using library sizes or, even better, size factors). Maybe the documentation for mnnCorrect
could have been more clear in this respect - but if you're using the scater/scran framework, you should have already scale-normalized via normalize
by the time you get log-expression values. More generally, I've never encountered as log-transformed expression matrix that hasn't been scale-normalized in some manner (e.g., log-CPMs, log-TPMs).
Now, onto your question. mnnCorrect
performs cosine normalization internally to adjust for differences in the size of the counts between batches (e.g., due to sequencing depth or capture efficiency or what have you). By default, the corrected values returned by the function are also on the cosine scale. This cannot be interpreted on the scale of the original log-expression values, even if you run exp
on the output values.
You could request the function to return corrected values on the original scale by setting cos.norm.out=FALSE
, and run exp
on that. However, I would not advise using these values in any downstream statistical model relying on raw counts. It's for much the same reason that we don't scale the counts directly in edgeR during normalization; doing so would distort the mean-variance relationship, making it possible to get sub-Poisson variation. ThisĀ generally should not be possible, even with UMIs, due to low capture efficiencies. I doubt that mnnCorrect
(or really, any correction method) will preserve the mean-variance relationship in the count data.
Perhaps some consideration of the bigger picture may be helpful. By the time you get output from mnnCorrect
, you should have already performed two rounds of cell-specific scaling normalization (size factors, cosine normalization) and the gene-specific batch correction itself. Do you really need more normalization on top of that? What biases are you encountering that lead you to think that more work is necessary? We use the corrected output from mnnCorrect
directly in clustering, and that seems to work quite well.