5 weeks ago by

Cambridge, United Kingdom

For starters, you'll need to log-transform the values. The problem lies in the fact that the usual pseudo-count of 1 makes no sense for TPMs. (One could argue that it doesn't make sense in general, but it is especially nonsensical in this case, where you're adding a "count" to a non-count TPM.) I discuss this in more detail here.

Without the counts, the next-best solution is to guess the average per-cell sequencing depth for the second experiment. For example, if you assume that each cell was sequenced to a depth of 5000, then you could recover some normalized count-like values by multiplying your TPMs with `5000/1e6`

. Then you can just log-transform it and feed it through the same *scater* + *scran* pipeline.

If you do it the other way (where you compute TPMs from the first dataset), you get to the same problem of choosing an appropriate offset for the log-transformation. This isn't entirely academic, because if your values are small compared to the pseudo-count, the log-transformation is basically a linear transformation. In scRNA-seq contexts, it tends to be the case that the TPMs are artificially large compared to the pseudo-count, which gives a lot of weight to the jump from zero to non-zero values (and thus increases the effect of noise due to dropouts, etc.).