Question

[MnnCorrect] In case of known subpopulations

0

Entering edit mode

Jens Preussner ▴ 30

@jens-preussner-6712

Last seen 6.6 years ago

Germany

Dear all, dear Laleh and Aaron,

I have a "best practice" question regarding the mnnCorrect that you provide in scran. My setup is the following: We sampled single cells from three different time points across two sequencing platforms (3'-end and full-length cDNA). Unfortunately, the sequencing depth is magnitudes lower for the 3'-end sequencing, as well as there is one timepoint missing in the 3'-end sequencing.

All in all, when I run mnnCorrect on both (full) expression matrices (full-length with three timepoints, 3'-end with two timepoints), the PCA/t-SNE looks quite ok, but still shows a batch effect that gets exaggerated even more when using other downstream dimension reduction methods (like SOMs).

My question is: If I have known subpopulations that are in matched principally (e.g. by capture timepoint), would you recommend to run mnnCorrect for each timepoint seperately? Or are the returned expression values becoming uncomparable (different scales?)?

Or is it a mixed effect due to mismatched sequencing depth / missing subpopulations? Which would lead to the question on how to evaluate the mnnCorrect output further than with PCA and t-SNE?

Best and thanks a lot for the great article on bioRxiv!

Jens

scran single-cell mnncorrect rnaseq • 2.8k views

ADD COMMENT • link 8.4 years ago Jens Preussner ▴ 30

0

Entering edit mode

Hi Aaron,

thanks a lot for your reply. The time points are not homogeneous populations. We know this from analyzing the full-length cDNA sequencing data alone and we expect the detected (sub)-subpopulations also in the 3'-end-sequencing data. However, analysis of this data alone does not resolve the heterogeneity very clearly. We had hoped to use mnnCorrect to roughly align the data and then resolve the heterogeneity downstream (which has worked for a different experiment with the same setup already).

I used the latest stable version of scran (1.4.5).

ADD REPLY • link 8.4 years ago Jens Preussner ▴ 30

0

Entering edit mode

Perhaps a more relevant question: is the sequencing technology the only difference between the batches? Were the cells sourced from the same population, and dissociated in the same manner? If so, the substructure within each time point should be the same across batches, and use of removeBatchEffect should be okay. If not, then you'll have to use mnnCorrect.

The most likely explanation for any poor performance is that the direction/magnitude of the batch effect is different across time points. This is very difficult to handle with batch correction methods, which need to make some assumptions about the manner of the batch effect in order to separate it from the biological effects. There are some ways around this - but it's not easy.

P.S. Respond to existing answers with "add comment" rather than adding your own answer.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 29k

0

Entering edit mode

Yes, the cells were sourced from the same population and dissociated in the same manner as well. I played around a bit and noticed that the normalization strategy might have an effect. Are there further recommendation for the mnnCorrect input, besides being log2 transformed? Is the log2(counts) as from scater's exprs slot ok, or should it rather be something from sum factor normalization / normalizeExprs?

ADD REPLY • link 8.4 years ago Jens Preussner ▴ 30

0

Entering edit mode

I think in the paper we used library size normalization or computeSumFactors to compute size factors, and then computed log-transformed expression values via scater's normalize method.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 29k

score 0 · Answer 1 · 2017-07-25

The raison d'etre of mnnCorrect is that it can remove batch effects between two or more data sets when the internal structure of each data set is not known; there only has to be some shared subpopulation between the data sets, but we don't have to know which cells are in the shared subpopulations a priori. Now, if each time point is a distinct homogeneous subpopulation, then there is no real need to use mnnCorrect to correct the batch effect between your two sequencing technologies. You can just use removeBatchEffect with a design matrix accounting for the time point factor in design, and the sequencing technology factor in block.

That said, you should still be able to use mnnCorrect in this setting; it just does more work than would be necessary if removeBatchEffect could have been used instead. Make sure that you run the function on the highly variable genes only, rather than using the full expression matrix. This improves the resolution of the biological structure by reducing the amount of stochastic noise during the detection of nearest neighbours. Also make sure that you're using the latest version of scran, as there have been some bug fixes to mnnCorrect.