Question: Feature selection across batches in simpleSingleCell when spike-ins are not available.
0
4 weeks ago by
Angelos Armen0 wrote:

In the simpleSingleCell Correcting batch effects vignette spike-ins are available. What is the recommended workflow in the absence of spike-ins? I understand that makeTechTrend should be applied to each batch separately and then genes with positive biological variance in any batch should be selected before calling multiBatchNorm. makeTechTrend should not be applied after multiBatchNorm because the latter scales the counts and a scaled Poisson distribution is not a Poisson distribution (as assumed by makeTechTrend for the counts of constant genes at a given size factor).

simplesinglecell • 121 views
modified 4 weeks ago by Aaron Lun24k • written 4 weeks ago by Angelos Armen0
Answer: Feature selection across batches in simpleSingleCell when spike-ins are not avai
2
4 weeks ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

In the simpleSingleCell Correcting batch effects vignette spike-ins are available.

I wouldn't read too much into that. It just happened that spike-ins were available so I used them for modelling the mean-variance trend. If you don't have spike-ins, then just swap it for one of the other modelling approaches.

I understand that makeTechTrend should be applied to each batch separately and then genes with positive biological variance in any batch should be selected before calling multiBatchNorm.

• multiBatchNorm() is called on the universe of all common genes, not just the HVGs. This is actually important to reduce the risk of violating the non-DE assumption between batches.
• As I mentioned in my answer to your other question, we don't look for genes with positive biological components in any batch.

makeTechTrend should not be applied after multiBatchNorm because the latter scales the counts and a scaled Poisson distribution is not a Poisson distribution (as assumed by makeTechTrend for the counts of constant genes at a given size factor).

makeTechTrend() only assumes that the underlying counts are Poisson. It never assumes that the scaled counts are Poisson. Even if you apply makeTechTrend() in a single batch scenario, there would be scaling involved because of the differences in size factors between cells, so the function is already handling it.

But to answer your actual question, no, I would not bother applying makeTechTrend() after multiBatchNorm(). Instead, I would apply makeTechTrend() in each batch to compute the biological component per batch, which seems to be what you were planning to do. However...

makeTechTrend() provides little benefit here, because its primary role is to estimate the technical component for denoisePCA()... but I don't use denoisePCA() in multi-sample studies, simply because it is even harder to choose the number of PCs for multiple samples when there's batch vectors thrown into the mix. So, that means that makeTechTrend()'s utility is limited to feature selection, and if you're working with any kind of 10X data, you'll notice that almost every gene has a positive biological component anyway when you use makeTechTrend(). (Which is sensible, as every gene should exhibit some overdispersion.)

This all means that if you take all genes with positive average components, you just end up keeping the vast majority of genes for the fastMNN() step. That's not necessarily wrong, but you could have gotten to the same point by skipping all of the variance modelling steps. If you wanted to make more use of the trend, you could exploit the magnitude of the biological component, e.g., by taking the top X genes with the largest average components; or you could fit the trend directly to the endogenous genes with use.spikes=FALSE. All of these approaches are valid, pending different assumptions about what is interesting in the data.

multiBatchNorm() is called on the universe of all common genes, not just the HVGs. This is actually important to reduce the risk of violating the non-DE assumption between batches.

Sorry I meant to write

genes with positive biological variance in any batch should be selected before calling fastMNN

instead of multiBatchNorm. I'm calling multiBatchNorm on all genes. In the vignette, however, multiBatchNorm is used with HVGs.

makeTechTrend() only assumes that the underlying counts are Poisson. It never assumes that the scaled counts are Poisson. Even if you apply makeTechTrend() in a single batch scenario, there would be scaling involved because of the differences in size factors between cells, so the function is already handling it.

It turns out that the reason for my confusion was multiBatchNorm, I thought it did something else than what it actually does. I had previously examined makeTechTrend and now I understand that it's perfectly valid to call it after multiBatchNorm.

makeTechTrend() provides little benefit here, because its primary role is to estimate the technical component for denoisePCA()... but I don't use denoisePCA() in multi-sample studies, simply because it is even harder to choose the number of PCs for multiple samples when there's batch vectors thrown into the mix

Wouldn't it be valid to use denoisePCANumber after multiBatchPCA with var.tech and var.total computed by taking grand means instead of means? (That would require implementing a makeMultiBatchTechTrend function to compute var.tech, though.)

1

In the vignette, however, multiBatchNorm is used with HVGs.

Is it? Looks like it's being called with universe.

Wouldn't it be valid to use denoisePCANumber after multiBatchPCA with var.tech and var.total computed by taking grand means instead of means?

I thought that too, see attempts here. But some anecdotal testing suggests it doesn't work that well - heterogeneity within each batch seems to be lost if you pick too few PCs for this step. I would guess that the large batch effects are pushing biological signal into later PCs where they get discarded if you don't keep enough of them (and denoisePCA is known to err on the side of picking too few, given that it defines the lower bound). There are also some more subtle problems with fastMNN() if you have too few PCs, as the the batch effect may no longer be orthogonal to the biological subspace in low dimensions - see the reference in ?fastMNN.

Is it? Looks like it's being called with universe.

In the merging vignette of compareSingleCell, var.tech is computed by combineVar within multiBlockVar. combineVar, however, returns mean variances across batches. Shouldn't pooled variances be returned instead?

No, because you don't want to capture the variance due to the batch effect.

Recall that denoisePCA() operates by assuming that the technical noise is random and occupies the later PCs that are ultimately discarded. A batch effect is not random and will likely occupy one of the very early (often the first or second) PCs. If you include the variance due to the batch effect in your estimate of the technical noise, you will increase the apparent technical variance and thus the number of PCs you discard. This will likely chew into the PCs representing biology without actually removing the PCs driven by the batch effect.

The PCA has little to do with batch correction itself. It is simply a way to compress the data and remove some random noise. From this perspective, it is important to preserve the batch-to-batch differences so that the actual correction (i.e., the MNN step) can work properly.