13 months ago by
Cambridge, United Kingdom
I'd process cells in each run separately up until the point that they need to be combined. This is actually necessary for some procedures -
emptyDrops as the ambient pool will probably differ between runs; and
doubletCells, as doublets can't form between runs. Processing them separately will also make things clearer with respect to quality control of individual samples, and just generally give you a more precise idea of what is present in each sample before you try to mush everything together into a single data set.
The only downside of processing them separately is that you cannot detect genes that are highly variable across samples. If some samples are from different conditions, the standard variance modelling within each sample will not pick up the genes that are only DE between conditions. How much of this is a problem depends on your downstream applications. If you're going to batch correct across all samples anyway, then it doesn't matter as any DE genes would end up being wiped out by the batch correction.
You can avoid this with careful experimental design, e.g., paired WT/KO samples in each batch so that correction cannot remove genotype differences. You can also detect DE genes between conditions by summing cells within each batch (possibly per population) and treating them as pseudo-bulk for edgeR analyses (see https://doi.org/10.1093/biostatistics/kxw055). This complements a batch-corrected single-cell-level analysis, e.g., when a treatment causes both a systematic DE and changes in population composition.
Also, make sure you read the devel version of the batch correction workflow, which is quite a bit more performant than