Question

[MnnCorrect] Remove batch effects but keep condition effects

1

Entering edit mode

John Reid ▴ 10

@john-reid-18530

Last seen 5.7 years ago

University of Cambridge

Dear all, Laleh & Aaron,

I'm hoping someone can point to the best way to handle the data that I have.

The experimental design is as follows. Our study has individuals from three distinct conditions say control, condition1 and condition2. Cells from each individual are sorted into two related types say A and B. These samples are sequenced using 10X in the same batch/run. For practical reasons we cannot assay cells from different individuals in the same batch.

Cell type A is of greater biological interest. In particular we are interested in transcriptional changes in cells of type A between the control individuals and either condition.

The numbers of cells in each cell type sample varies from ~500 to ~10,000. The median reads per cell varies from ~30 up to ~3,000. Cell type B is more temperamental than cell type A resulting in lower quality data. This is a pilot study so currently we only have data from 7 individuals (~15,000 cells in total).

This seems to fit the rationale for MnnCorrect fairly well in that we wish to correct for batch effects when the experimental design confounds batches with individuals. However if I run MnnCorrect on all the batches together, I will likely remove the effect that we are looking for. Is my best bet to run MnnCorrect separately on the individuals belonging to each condition? What is the best way to make use of the cell sorting to help with the batch correction?

I don't have experience with data of such low coverage. Most/all of the single cell methods seem to be designed for data with much higher coverage. MnnCorrect doesn't seem to assume high depth data.

As the median reads per cell vary so much across samples. I do not wish to apply standard QC metrics to the data as a whole. Would you recommend applying standard QC procedures to each batch separately? At the moment I am just doing very simple QC to the data as a whole: removing cells with a high percentage of mitochondrial reads and filtering the genes down to the 10% most variable. I'm not using any thresholds for number of genes detected or total read counts.

Thanks for taking the time to read this and for a very nice publication on MnnCorrect. Any thoughts would be very welcome.

John.

single-cell scran mnncorrect rnaseq • 1.9k views

ADD COMMENT • link updated 5.7 years ago by Aaron Lun ★ 28k • written 5.7 years ago by John Reid ▴ 10

score 1 · Answer 1 · 2018-11-28

I would analyze your experiment in the following way, assuming you want to map all of your cells to a common coordinate system (you could also do this separately for each cell type, if so inclined):

Treat all cells from a single individual as belonging to the same batch.
Perform MNN correction across individuals, regardless of condition.
Cluster cells in the corrected space to define the cell subtypes.
Perform DE analyses between conditions within each cluster, using the uncorrected counts.
Perform differential abundance analyses between conditions for each cluster.

A difference between your conditions will manifest in one of two ways. The first is that it creates a completely different cell type, which will be detected as a change in abundance in 5. The second is that it changes the expression of an existing cell type but not in a manner that creates a separate cluster, such that the correction will force those cells into the same common cluster. Fortunately, this will still be detected by 4 using the uncorrected counts.

I'll refer you to the cydar package and diffcyt workflow for some indications of how to do 5. (These are written for mass cytometry, but the general concepts are still the same.) DE analyses are straightforward and are most easily done using pseudo-bulk counts (see Section 5.2 of this workflow) from all cells in a cluster for each individual. It's important to use the uncorrected counts to do this, as the correction will have eliminated any differences between individuals, so by definition you wouldn't get any differences in corrected values.

Note that 1 is necessary if you're processing cell types A and B together, as MNN requires a shared manifold (i.e., some common cell type) across the batches you are trying to correct. This would not be possible if you tried to correct within individuals, i.e., across A and B. In general, I haven't seen strong lane-to-lane effects from cells in the same run, so treating each individual as a batch should be okay. I would probably use a hierarchical approach for the correction, whereby all individuals of the same condition are corrected first, followed by correction between conditions. An example of how to do this is provided in the MNN workflow.

Of course, all of this is assuming that cell types A and B are somewhat heterogeneous. If these are homogeneous populations, you might as well just do a DE analysis directly on the pseudo-bulk counts obtained by adding all cells of the same type together in each individual. In fact, if you were confident of obtaining homogeneous cell types and all you cared about was DE between conditions, you might as well have done bulk RNA-seq, which probably would have been much cheaper.

Applying QC and feature detection on each sample is fine, this is what we suggest in the MNN workflow anyway. Large data sets tend to be too heterogeneous for a standardized quality control regime. Technically, this makes it difficult to compare across samples as sample-specific QC could introduce biases. However, it's something we'll just have to put up with for the time being. You may also want to consider using emptyDrops for identifying cells from each sample, see the 10X workflow.

A median of 30 reads (UMIs?) per cell is... pretty bad. Seems like a sample that would need more sequencing. That level of differences in coverage is unlikely to be well tolerated by any method, there's simply too much sampling noise at those kinds of counts.