Scaling option in fastMNN
1
0
Entering edit mode
ATpoint ▴ 860
@atpoint-13662
Last seen 10 days ago
Germany

In scater::calculatePCA there is an option to scale the logcounts prior to the actual PCA. I could not find a similar option in the context of fastMNN and multiBatchPCA. Does a similar option exist or would one need to scale counts externally, and is this even required or beneficial in the fastMNN context?

fastmnn scater batchelor • 220 views
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 25 minutes ago
The city by the bay

I have always been a bit dubious about standardization of log-expression matrices prior to PCA. We know that some genes are more informative than others, so why force them all to have the same variance? That suppresses the contribution of interesting genes that are highly variable while increasing the contribution of low-variance genes that are less interesting or driven by technical noise. (Further comments here.)

From a practical perspective, not scaling means that the downstream results are more robust to the exact choice of the number of HVGs you've selected. This is because the inclusion of lower-variance genes won't contribute much to the existing heterogeneity; in contrast, with scaling, each extra gene you decide to include now adds the same amount of variance as your top HVGs. Not scaling is also a bit logistically simpler as it saves us from a round of variance calculations and scaling. (Indeed, IIRC, older releases of irlba had bugs around the scaling functionality... and happily enough, this never affected me, because I never scaled in the first place.)

In the case of batch correction, it becomes even more complex because the naively-computed variance per gene will now include the batch effect. I don't even know what it means to scale by the variance in this scenario. I suppose you could instead scale by the estimates from combineVars to standardize on the average within-batch variance, but that's extra work for no clear benefit.