I am following the Bioconductor simpleSingleCell workflow for droplet-based data and have a question regarding pre-processing. I have 10x scRNA-seq data from multiple samples. These were prepared in different wells on the same Chromium chip and ran in the same lane of a single flowcell using the HiSeq 4000 sequencing machine. I ultimately want to perform comparisons between the samples, however I'm not sure at what stage of pre-processing the samples should be combined. In particular, the RNA content and activity of cells between samples may differ markedly so I assume empty droplet detection step should be performed independently? Given cells from different samples are physically separated on the Chromium chip I also assume doublet detection should be performed independently?
My proposed workflow would be the following:
- Remove barcode swapping (All samples)
- Remove empty droplets (Per sample)
- Calculate QC metrics (Per sample)
- Remove low quality cells (Per sample)
- Assign cell cycle phases (Per sample)
- Remove zero count genes (Per sample, may cause problems later)
- Normalization for cell-specific biases (Per sample)
- Modelling the mean-variance trend (Per sample)
- Dimensionality reduction (Per sample)
- Clustering (Per sample)
- Remove doublets detected by clusters / by simulation (Per sample)
- Combine raw count matrices from all remaining cells across samples
- Go back to the normalization step (7) and process all samples together
Does this seem reasonable, or am I over-complicating the pre-processing steps?