9 months ago by
Cambridge, United Kingdom
Before we start: the input should be scale-normalized prior to supplying them to
mnnCorrect (e.g., using library sizes or, even better, size factors). Maybe the documentation for
mnnCorrect could have been more clear in this respect - but if you're using the scater/scran framework, you should have already scale-normalized via
normalize by the time you get log-expression values. More generally, I've never encountered as log-transformed expression matrix that hasn't been scale-normalized in some manner (e.g., log-CPMs, log-TPMs).
Now, onto your question.
mnnCorrect performs cosine normalization internally to adjust for differences in the size of the counts between batches (e.g., due to sequencing depth or capture efficiency or what have you). By default, the corrected values returned by the function are also on the cosine scale. This cannot be interpreted on the scale of the original log-expression values, even if you run
exp on the output values.
You could request the function to return corrected values on the original scale by setting
cos.norm.out=FALSE, and run
exp on that. However, I would not advise using these values in any downstream statistical model relying on raw counts. It's for much the same reason that we don't scale the counts directly in edgeR during normalization; doing so would distort the mean-variance relationship, making it possible to get sub-Poisson variation. This generally should not be possible, even with UMIs, due to low capture efficiencies. I doubt that
mnnCorrect (or really, any correction method) will preserve the mean-variance relationship in the count data.
Perhaps some consideration of the bigger picture may be helpful. By the time you get output from
mnnCorrect, you should have already performed two rounds of cell-specific scaling normalization (size factors, cosine normalization) and the gene-specific batch correction itself. Do you really need more normalization on top of that? What biases are you encountering that lead you to think that more work is necessary? We use the corrected output from
mnnCorrect directly in clustering, and that seems to work quite well.
modified 9 months ago
9 months ago by
Aaron Lun • 21k