Search
Question: [MnnCorrect] In case of known subpopulations
0
gravatar for Jens Preussner
3 months ago by
Germany
Jens Preussner10 wrote:

Dear all, dear Laleh and Aaron,

I have a "best practice" question regarding the mnnCorrect that you provide in scran. My setup is the following: We sampled single cells from three different time points across two sequencing platforms (3'-end and full-length cDNA). Unfortunately, the sequencing depth is magnitudes lower for the 3'-end sequencing, as well as there is one timepoint missing in the 3'-end sequencing.

All in all, when I run mnnCorrect on both (full) expression matrices (full-length with three timepoints, 3'-end with two timepoints), the PCA/t-SNE looks quite ok, but still shows a batch effect that gets exaggerated even more when using other downstream dimension reduction methods (like SOMs).

My question is: If I have known subpopulations that are in matched principally (e.g. by capture timepoint), would you recommend to run mnnCorrect for each timepoint seperately? Or are the returned expression values becoming uncomparable (different scales?)?

Or is it a mixed effect due to mismatched sequencing depth / missing subpopulations? Which would lead to the question on how to evaluate the mnnCorrect output further than with PCA and t-SNE?

Best and thanks a lot for the great article on bioRxiv!

Jens

ADD COMMENTlink modified 3 months ago • written 3 months ago by Jens Preussner10

Hi Aaron,

thanks a lot for your reply. The time points are not homogeneous populations. We know this from analyzing the full-length cDNA sequencing data alone and we expect the detected (sub)-subpopulations also in the 3'-end-sequencing data. However, analysis of this data alone does not resolve the heterogeneity very clearly. We had hoped to use mnnCorrect to roughly align the data and then resolve the heterogeneity downstream (which has worked for a different experiment with the same setup already).

I used the latest stable version of scran (1.4.5).

ADD REPLYlink written 3 months ago by Jens Preussner10

Perhaps a more relevant question: is the sequencing technology the only difference between the batches? Were the cells sourced from the same population, and dissociated in the same manner? If so, the substructure within each time point should be the same across batches, and use of removeBatchEffect should be okay. If not, then you'll have to use mnnCorrect.

The most likely explanation for any poor performance is that the direction/magnitude of the batch effect is different across time points. This is very difficult to handle with batch correction methods, which need to make some assumptions about the manner of the batch effect in order to separate it from the biological effects. There are some ways around this - but it's not easy.

P.S. Respond to existing answers with "add comment" rather than adding your own answer.

ADD REPLYlink modified 3 months ago • written 3 months ago by Aaron Lun17k

Yes, the cells were sourced from the same population and dissociated in the same manner as well. I played around a bit and noticed that the normalization strategy might have an effect. Are there further recommendation for the mnnCorrect input, besides being log2 transformed? Is the log2(counts) as from scater's exprs slot ok, or should it rather be something from sum factor normalization / normalizeExprs?

ADD REPLYlink written 3 months ago by Jens Preussner10

I think in the paper we used library size normalization or computeSumFactors to compute size factors, and then computed log-transformed expression values via scater's normalize method.

ADD REPLYlink modified 3 months ago • written 3 months ago by Aaron Lun17k
0
gravatar for Aaron Lun
3 months ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

The raison d'etre of mnnCorrect is that it can remove batch effects between two or more data sets when the internal structure of each data set is not known; there only has to be some shared subpopulation between the data sets, but we don't have to know which cells are in the shared subpopulations a priori. Now, if each time point is a distinct homogeneous subpopulation, then there is no real need to use mnnCorrect to correct the batch effect between your two sequencing technologies. You can just use removeBatchEffect with a design matrix accounting for the time point factor in design, and the sequencing technology factor in block.

That said, you should still be able to use mnnCorrect in this setting; it just does more work than would be necessary if removeBatchEffect could have been used instead. Make sure that you run the function on the highly variable genes only, rather than using the full expression matrix. This  improves the resolution of the biological structure by reducing the amount of stochastic noise during the detection of nearest neighbours. Also make sure that you're using the latest version of scran, as there have been some bug fixes to mnnCorrect.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Aaron Lun17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 279 users visited in the last hour