Question

Downstream analysis after using mnnCorrect()

0

Entering edit mode

Dave Tang ▴ 210

@dave-tang-4661

Last seen 7.1 years ago

Australia/Perth/UWA

Hello,

I have several batches of scRNA-seq data, which are biological replicates, and would like to use mnnCorrect() from the scran package to correct for batch effects. Each batch should have a quite a different cell type composition, which is why I would like to use the MNN method.

From the help page it says that: "The input expression values should generally be log-transformed." This is fine but my question is regarding the downstream analysis of the corrected values. In particular I would like to use the Seurat pipeline but since Seurat also carries out a log normalisation step, is it OK for me to simply use exp() on the corrected MNN values to feed into Seurat as "raw" counts?

Thank you,

Dave

scran mnncorrect • 3.6k views

ADD COMMENT • link updated 7.4 years ago by Aaron Lun ★ 28k • written 7.4 years ago by Dave Tang ▴ 210

score 6 · Accepted Answer · 2018-02-14

Before we start: the input should be scale-normalized prior to supplying them to mnnCorrect (e.g., using library sizes or, even better, size factors). Maybe the documentation for mnnCorrect could have been more clear in this respect - but if you're using the scater/scran framework, you should have already scale-normalized via normalize by the time you get log-expression values. More generally, I've never encountered as log-transformed expression matrix that hasn't been scale-normalized in some manner (e.g., log-CPMs, log-TPMs).

Now, onto your question. mnnCorrect performs cosine normalization internally to adjust for differences in the size of the counts between batches (e.g., due to sequencing depth or capture efficiency or what have you). By default, the corrected values returned by the function are also on the cosine scale. This cannot be interpreted on the scale of the original log-expression values, even if you run exp on the output values.

You could request the function to return corrected values on the original scale by setting cos.norm.out=FALSE, and run exp on that. However, I would not advise using these values in any downstream statistical model relying on raw counts. It's for much the same reason that we don't scale the counts directly in edgeR during normalization; doing so would distort the mean-variance relationship, making it possible to get sub-Poisson variation. This generally should not be possible, even with UMIs, due to low capture efficiencies. I doubt that mnnCorrect (or really, any correction method) will preserve the mean-variance relationship in the count data.

Perhaps some consideration of the bigger picture may be helpful. By the time you get output from mnnCorrect, you should have already performed two rounds of cell-specific scaling normalization (size factors, cosine normalization) and the gene-specific batch correction itself. Do you really need more normalization on top of that? What biases are you encountering that lead you to think that more work is necessary? We use the corrected output from mnnCorrect directly in clustering, and that seems to work quite well.