Question

Assay choice for cell type annotation using SingleR when batch effect present in profile

0

Entering edit mode

jma1991 ▴ 70

@jma1991-11856

Last seen 2.3 years ago

Cumbernauld

I want to use a single-cell reference dataset (MouseGastrulationData) to annotate my cells using SingleR, but I am not sure whether I should use the original logcounts assay or the batch-corrected values in the reconstructed assay. To speed-up the annotation I would also like to use the aggregateReference function but again I am unsure what effect averaging the count profiles from the uncorrected logcounts assay would have for annotation.

Edit:

There is an option to specify multiple references, splitting my data by the blocking factor and annotating the cells seems to work - not sure how I can process the results to combine the same label annotation from different blocks (probably just take the average score)

SingleR MouseGastrulationData • 1.7k views

ADD COMMENT • link updated 5.3 years ago by Aaron Lun ★ 28k • written 5.3 years ago by jma1991 ▴ 70

score 0 · Answer 1 · 2020-03-18

I am not sure whether I should use the original logcounts assay or the batch-corrected values in the reconstructed assay.

Use the logcounts. Marker detection inside trainSingleR() (called by the main SingleR() function) is done using differential expression, and it is generally not a good idea to use the corrected values in DE analyses (see a discussion at https://osca.bioconductor.org/integrating-datasets.html#using-corrected-values).

The obvious issue is that the log-normalized expression values will still have some kind of batch effect in it. One might consider blocking on the batch (or sample, can't really remember) when detecting markers.

To speed-up the annotation I would also like to use the aggregateReference function but again I am unsure what effect averaging the count profiles from the uncorrected logcounts assay would have for annotation.

Should be fine as long as you do the aggregation within each batch. This preserves the batch effect in the reference dataset, which is a good thing if you want to match up your data (with its own batch effect) with the reference.

There is an option to specify multiple references, splitting my data by the blocking factor and annotating the cells seems to work - not sure how I can process the results to combine the same label annotation from different blocks (probably just take the average score)

This is done by combineResults. We're still figuring out the best way of doing it, though this is generally intended for situations where you have much more diverse datasets from different labs, technologies, etc.

tl;dr This reference dataset is somewhat complex, so I would dig into the sausage factory a little.

Define your own markers with some combination of scran::pairwiseWilcox (or scran::pairwiseTTests, depending on what takes your fancy) and scran::getTopMarkers. This allows you to block on sample/batch. The output of getTopMarkers will be passed to the genes= argument of SingleR.
Aggregate references on the log-fold changes with aggregateReference. You'll have to do a bit of a hack to get it to only aggregate within each label/batch combination. I would paste the label and the batch together, aggregate with those paste'd labels, and then split them out again at the end.
Use the aggregated references and the new markers in SingleR().

We could probably expedite this by having a block= argument in trainSingleR and aggregateReference to support blocking in the reference. If you make an issue at https://github.com/LTLA/SingleR, I'll ponder it.