Question

Using controlAmbience with multiple samples and conditions

0

Entering edit mode

jma1991 ▴ 70

@jma1991-11856

Last seen 2.9 years ago

Cumbernauld

I have snRNA-seq data from multiple samples and conditions. I used the MT genes as the control features to determine the ambient contamination and performed a full analysis of the data, which included MNN integration. I now want to clean up the expression matrix to aid manual cell type annotation. I was going to use the controlAmbience function, however this returns a count matrix that wouldn't reflect the transformations introduced by the MNN integration. I essentially would have an expression matrix generated by the integration, and an expression matrix generated by the ambient correction. In this pretty common scenario, containing multiple samples and conditions, how should one proceed? My thoughts were:

Ignore the apparent disparity - use the MNN integrated matrix for the dimensionality reduction/clustering and use the ambient corrected matrix for visualization of gene expression and interpretation.
Analyse each sample separately and produce an ambient corrected count matrix at the end. Use the corrected matrices as input to the data integration stage and proceed with the usual downstream analyses (e.g. dimensionality reduction, clustering, marker detection)

I'm leaning more toward the second option, but it requires some extra processing and I'm unsure whether I may have missed anything in my understanding of the application of the controlAmbience function.

DropletUtils • 1.7k views

ADD COMMENT • link updated 4.1 years ago by Aaron Lun ★ 29k • written 4.2 years ago by jma1991 ▴ 70

score 0 · Answer 1 · 2022-01-01

Personally, I would go with choice 1.

I don't think it's a problem that the ambient-corrected matrix is not what is used in MNN correction. If the ambient contamination is different between samples, hopefully that will be eliminated by MNN correction as part of the between-sample effect, and so it won't affect the downstream steps. (And if it's the same, then it doesn't matter.)

Conversely, it's not a problem that your visualization/interpretation is not using MNN corrected values. I rarely use the reconstructed values myself for per-gene analyses, as they are difficult to interpret - who knows what the batch correction had to do to align cells from different samples. Rather, I look at the uncorrected expression values and block on sample; you can do the same, but after adjusting for ambient correction within each sample.

On a side note: using mitochondrial genes as a control only works well if the snRNA-seq experiment itself was a bit wonky. The better your stripping, the less mitochondrial contamination you'll have, and the less stable your estimates of the ambient contamination. In the few snRNA-seq datasets I've looked at, I had a median of 0% mitochondrial coverage, so I don't know how general this approach would be - YMMV.