In the simpleSingleCell Correcting batch effects vignette spike-ins are available.
I wouldn't read too much into that. It just happened that spike-ins were available so I used them for modelling the mean-variance trend. If you don't have spike-ins, then just swap it for one of the other modelling approaches.
I understand that makeTechTrend should be applied to each batch separately and then genes with positive biological variance in any batch should be selected before calling multiBatchNorm.
multiBatchNorm()
is called on the universe of all common genes, not just the HVGs. This is actually important to reduce the risk of violating the non-DE assumption between batches.
- As I mentioned in my answer to your other question, we don't look for genes with positive biological components in any batch.
makeTechTrend should not be applied after multiBatchNorm because the latter scales the counts and a scaled Poisson distribution is not a Poisson distribution (as assumed by makeTechTrend for the counts of constant genes at a given size factor).
makeTechTrend()
only assumes that the underlying counts are Poisson. It never assumes that the scaled counts are Poisson. Even if you apply makeTechTrend()
in a single batch scenario, there would be scaling involved because of the differences in size factors between cells, so the function is already handling it.
But to answer your actual question, no, I would not bother applying makeTechTrend()
after multiBatchNorm()
. Instead, I would apply makeTechTrend()
in each batch to compute the biological component per batch, which seems to be what you were planning to do. However...
makeTechTrend()
provides little benefit here, because its primary role is to estimate the technical component for denoisePCA()
... but I don't use denoisePCA()
in multi-sample studies, simply because it is even harder to choose the number of PCs for multiple samples when there's batch vectors thrown into the mix. So, that means that makeTechTrend()
's utility is limited to feature selection, and if you're working with any kind of 10X data, you'll notice that almost every gene has a positive biological component anyway when you use makeTechTrend()
. (Which is sensible, as every gene should exhibit some overdispersion.)
This all means that if you take all genes with positive average components, you just end up keeping the vast majority of genes for the fastMNN()
step. That's not necessarily wrong, but you could have gotten to the same point by skipping all of the variance modelling steps. If you wanted to make more use of the trend, you could exploit the magnitude of the biological component, e.g., by taking the top X genes with the largest average components; or you could fit the trend directly to the endogenous genes with use.spikes=FALSE
. All of these approaches are valid, pending different assumptions about what is interesting in the data.
Sorry I meant to write
instead of
multiBatchNorm
. I'm callingmultiBatchNorm
on all genes. In the vignette, however,multiBatchNorm
is used with HVGs.It turns out that the reason for my confusion was
multiBatchNorm
, I thought it did something else than what it actually does. I had previously examinedmakeTechTrend
and now I understand that it's perfectly valid to call it aftermultiBatchNorm
.Wouldn't it be valid to use
denoisePCANumber
aftermultiBatchPCA
withvar.tech
andvar.total
computed by taking grand means instead of means? (That would require implementing amakeMultiBatchTechTrend
function to computevar.tech
, though.)Is it? Looks like it's being called with
universe
.I thought that too, see attempts here. But some anecdotal testing suggests it doesn't work that well - heterogeneity within each batch seems to be lost if you pick too few PCs for this step. I would guess that the large batch effects are pushing biological signal into later PCs where they get discarded if you don't keep enough of them (and
denoisePCA
is known to err on the side of picking too few, given that it defines the lower bound). There are also some more subtle problems withfastMNN()
if you have too few PCs, as the the batch effect may no longer be orthogonal to the biological subspace in low dimensions - see the reference in?fastMNN
.Sorry my bad, you're right.
In the merging vignette of
compareSingleCell
,var.tech
is computed bycombineVar
withinmultiBlockVar
.combineVar
, however, returns mean variances across batches. Shouldn't pooled variances be returned instead?No, because you don't want to capture the variance due to the batch effect.
Recall that
denoisePCA()
operates by assuming that the technical noise is random and occupies the later PCs that are ultimately discarded. A batch effect is not random and will likely occupy one of the very early (often the first or second) PCs. If you include the variance due to the batch effect in your estimate of the technical noise, you will increase the apparent technical variance and thus the number of PCs you discard. This will likely chew into the PCs representing biology without actually removing the PCs driven by the batch effect.The PCA has little to do with batch correction itself. It is simply a way to compress the data and remove some random noise. From this perspective, it is important to preserve the batch-to-batch differences so that the actual correction (i.e., the MNN step) can work properly.