Dear all, dear Laleh and Aaron,
I have a "best practice" question regarding the mnnCorrect that you provide in scran. My setup is the following: We sampled single cells from three different time points across two sequencing platforms (3'-end and full-length cDNA). Unfortunately, the sequencing depth is magnitudes lower for the 3'-end sequencing, as well as there is one timepoint missing in the 3'-end sequencing.
All in all, when I run mnnCorrect on both (full) expression matrices (full-length with three timepoints, 3'-end with two timepoints), the PCA/t-SNE looks quite ok, but still shows a batch effect that gets exaggerated even more when using other downstream dimension reduction methods (like SOMs).
My question is: If I have known subpopulations that are in matched principally (e.g. by capture timepoint), would you recommend to run mnnCorrect for each timepoint seperately? Or are the returned expression values becoming uncomparable (different scales?)?
Or is it a mixed effect due to mismatched sequencing depth / missing subpopulations? Which would lead to the question on how to evaluate the mnnCorrect output further than with PCA and t-SNE?
Best and thanks a lot for the great article on bioRxiv!
Jens
Hi Aaron,
thanks a lot for your reply. The time points are not homogeneous populations. We know this from analyzing the full-length cDNA sequencing data alone and we expect the detected (sub)-subpopulations also in the 3'-end-sequencing data. However, analysis of this data alone does not resolve the heterogeneity very clearly. We had hoped to use mnnCorrect to roughly align the data and then resolve the heterogeneity downstream (which has worked for a different experiment with the same setup already).
I used the latest stable version of scran (1.4.5).
Perhaps a more relevant question: is the sequencing technology the only difference between the batches? Were the cells sourced from the same population, and dissociated in the same manner? If so, the substructure within each time point should be the same across batches, and use of
removeBatchEffect
should be okay. If not, then you'll have to use mnnCorrect.The most likely explanation for any poor performance is that the direction/magnitude of the batch effect is different across time points. This is very difficult to handle with batch correction methods, which need to make some assumptions about the manner of the batch effect in order to separate it from the biological effects. There are some ways around this - but it's not easy.
P.S. Respond to existing answers with "add comment" rather than adding your own answer.
Yes, the cells were sourced from the same population and dissociated in the same manner as well. I played around a bit and noticed that the normalization strategy might have an effect. Are there further recommendation for the mnnCorrect input, besides being log2 transformed? Is the log2(counts) as from scater's exprs slot ok, or should it rather be something from sum factor normalization / normalizeExprs?
I think in the paper we used library size normalization or
computeSumFactors
to compute size factors, and then computed log-transformed expression values via scater'snormalize
method.