Question

Recommendations for combining multiple 10x runs into one SingleCellExperiment

1

Entering edit mode

Peter Hickey ▴ 760

@petehaitch

Last seen 7 hours ago

WEHI, Melbourne, Australia

I've got an experiment with eight 10x scRNA-seq runs that I'm analysing by starting with something based on the simpleSingleCell workflow. I constructed each SingleCellExperiment with DropletUtils::read10xCounts().

I'm looking for any opinions or advice on when to combine these into one SingleCellExperiment object vs. having, say, a list of SingleCellExperiment objects (one per run)? I've been tossing up between a few options:

Pass all runs via the samples argument of DropletUtils::read10Counts() and adding run to the colData
Filter each run (to remove empty drops) independently and then combine.
Keep them separate until running something like scran::mnnCorrect() (which would seem to need a separate expression matrix for each run, anyway).

Thanks, Pete

singlecellexperiment dropletutils simplesinglecell • 3.6k views

ADD COMMENT • link updated 7.1 years ago by Aaron Lun ★ 29k • written 7.1 years ago by Peter Hickey ▴ 760

Steve Lianoglou · Answer 1 · 2018-10-15

I'd process cells in each run separately up until the point that they need to be combined. This is actually necessary for some procedures - emptyDrops as the ambient pool will probably differ between runs; and doubletCells, as doublets can't form between runs. Processing them separately will also make things clearer with respect to quality control of individual samples, and just generally give you a more precise idea of what is present in each sample before you try to mush everything together into a single data set.

The only downside of processing them separately is that you cannot detect genes that are highly variable across samples. If some samples are from different conditions, the standard variance modelling within each sample will not pick up the genes that are only DE between conditions. How much of this is a problem depends on your downstream applications. If you're going to batch correct across all samples anyway, then it doesn't matter as any DE genes would end up being wiped out by the batch correction.

You can avoid this with careful experimental design, e.g., paired WT/KO samples in each batch so that correction cannot remove genotype differences. You can also detect DE genes between conditions by summing cells within each batch (possibly per population) and treating them as pseudo-bulk for edgeR analyses (see https://doi.org/10.1093/biostatistics/kxw055). This complements a batch-corrected single-cell-level analysis, e.g., when a treatment causes both a systematic DE and changes in population composition.

Also, make sure you read the devel version of the batch correction workflow, which is quite a bit more performant than mnnCorrect.