Question

Batch Corrections (MNN) with TPM values

0

Entering edit mode

hamza_karakurt ▴ 60

@hamza_karakurt-17704

Last seen 23 months ago

Turkey

Hello, I have a simple question. I have 2 public data sets and I want to use both of them with MNN correction for certain analyses. One of the data sets has raw counts, which is suitable to use in Scater/Scran but the other one only has TPM values in the supplementary. To use MNN we need normalized counts as I know. If I convert raw counts of the first data set with calculateTPM() function and use TPM values of both of the data sets for MNN, do you think it will work?

Thank you in advance.

scRNA-Seq scater scran rna-seq MNN • 2.1k views

ADD COMMENT • link updated 4.8 years ago by Aaron Lun ★ 28k • written 4.8 years ago by hamza_karakurt ▴ 60

score 2 · Answer 1 · 2019-10-14

For starters, you'll need to log-transform the values. The problem lies in the fact that the usual pseudo-count of 1 makes no sense for TPMs. (One could argue that it doesn't make sense in general, but it is especially nonsensical in this case, where you're adding a "count" to a non-count TPM.) I discuss this in more detail here.

Without the counts, the next-best solution is to guess the average per-cell sequencing depth for the second experiment. For example, if you assume that each cell was sequenced to a depth of 5000, then you could recover some normalized count-like values by multiplying your TPMs with 5000/1e6. Then you can just log-transform it and feed it through the same scater + scran pipeline.

If you do it the other way (where you compute TPMs from the first dataset), you get to the same problem of choosing an appropriate offset for the log-transformation. This isn't entirely academic, because if your values are small compared to the pseudo-count, the log-transformation is basically a linear transformation. In scRNA-seq contexts, it tends to be the case that the TPMs are artificially large compared to the pseudo-count, which gives a lot of weight to the jump from zero to non-zero values (and thus increases the effect of noise due to dropouts, etc.).