I am not sure whether I should use the original logcounts assay or the batch-corrected values in the reconstructed assay.
Use the logcounts
. Marker detection inside trainSingleR()
(called by the main SingleR()
function) is done using differential expression, and it is generally not a good idea to use the corrected values in DE analyses (see a discussion at https://osca.bioconductor.org/integrating-datasets.html#using-corrected-values).
The obvious issue is that the log-normalized expression values will still have some kind of batch effect in it. One might consider blocking on the batch (or sample, can't really remember) when detecting markers.
To speed-up the annotation I would also like to use the aggregateReference function but again I am unsure what effect averaging the count profiles from the uncorrected logcounts assay would have for annotation.
Should be fine as long as you do the aggregation within each batch. This preserves the batch effect in the reference dataset, which is a good thing if you want to match up your data (with its own batch effect) with the reference.
There is an option to specify multiple references, splitting my data by the blocking factor and annotating the cells seems to work - not sure how I can process the results to combine the same label annotation from different blocks (probably just take the average score)
This is done by combineResults
. We're still figuring out the best way of doing it, though this is generally intended for situations where you have much more diverse datasets from different labs, technologies, etc.
tl;dr This reference dataset is somewhat complex, so I would dig into the sausage factory a little.
- Define your own markers with some combination of
scran::pairwiseWilcox
(or scran::pairwiseTTests
, depending on what takes your fancy) and scran::getTopMarkers
. This allows you to block on sample/batch. The output of getTopMarkers
will be passed to the genes=
argument of SingleR
.
- Aggregate references on the log-fold changes with
aggregateReference
. You'll have to do a bit of a hack to get it to only aggregate within each label/batch combination. I would paste
the label and the batch together, aggregate with those paste'd labels, and then split them out again at the end.
- Use the aggregated references and the new markers in
SingleR()
.
We could probably expedite this by having a block=
argument in trainSingleR
and aggregateReference
to support blocking in the reference. If you make an issue at https://github.com/LTLA/SingleR, I'll ponder it.