Some questions for remove batch effect by using mnnCorrect
Entering edit mode
xingxd16 ▴ 20
Last seen 5.1 years ago

Hi all :

  • I use 10X single cell gene expression to perform my samples. I have 9 samples now, however when I merge them together to analysis, I can find a obviously batch effect drived by each samples. Especially, T cells are clustered by each sample rather than cell types . So I want to employ mnnCorrect function in scran package to remove batch and make cells are clustered by celltype rather than sample.

  • Before use mnnCorrect, I have some questions . My 9 samples was performed one by one and not in the same day also. For each samples it use one 10X kit. So I think I have 9 batch for all my cells. The differences between sample cells are not only caused by batch but also caused by biology variant.

  • My question is , can mnnCorrect remove the batch in my situation ? If can, what paramaters should I pay attention and change to achived a better results. If can`t , any other advices for this situation ?

  • Thanks a lot , Best

batch effect single cells mnnCorrect scran • 1.8k views
Entering edit mode
Aaron Lun ★ 28k
Last seen 21 hours ago
The city by the bay

Yes, you can use mnnCorrect, or its faster and usually nicer-looking sibling fastMNN. Some instructions on the latter are provided here. I don't see any obvious problem with using MNN in the situation you've mentioned, so you'll just have to try it and see if it works.

While we're on this topic: MNN will happily remove biological differences between samples. This is not a bug, but a feature. To give an example - the compareSingleCell workflow uses fastMNN to merge wild-type and knock-out samples together prior to downstream analysis, i.e., clustering and annotation of clusters. If we tried to preserve biological differences between samples, the wild-type and knock-out cells would never cluster together, as they would be separated in their expression profiles by the big effect of the knock-out. This is "biologically accurate" but would defeat the purpose of doing merging in the first place, because now we need to cluster and annotate each genotype separately. Differential abundance analyses would become silly - "why yes, the abundance of the knock-out mesoderm cluster increases in the knock-out mice" - and trying to match up clusters between genotypes is not a pleasant experience in developmental settings where the clusters are not distinct.

By getting rid of the differences between samples, we can establish a common annotation that allows us to more easily compare cell types/states between samples. Once this is established, you can then go back to the original expression values to do a pseudo-bulk DE analysis (see one of the later vignettes in the workflow) to recover the differences between samples. And of course, if you don't fully trust the batch correction, you can simply cluster each sample individually. In doing so, you can take advantage of the hard work that you did in setting up the common annotation to guide your per-sample annotation - then you don't have to re-annotate everything from scratch, you only have to worry about big discrepancies from the common annotation.

I've found that people get upset when I tell them to remove biological differences between samples. But the alternative is to have, e.g., all cells from each patient clustering separately, which makes the merge useless.

Entering edit mode

Thanks a lot Aaron !!! I still confused with somethings.

  • (1) Whats the difference between mnnCorrect and fastMNN? From the method name maybe the fastMNN is more faster than mnnCorrect. My data is so big , can you give me a example that to illustrate how to use BPPARAM=SerialParam() to make the calculate faster by using mutiple cores.
  • (2) I can extract the batch corrected data from the mnnCorrect result by using mnnCorrect$corrected[[1]], mnnCorrect$corrected[[2]] and cbind them together to do the downstream analysis. How to get this corrected matrix by using fastMNN?
  • (3) After I get the batch corrected data, can I use the data to do PCA straightly? Or before that I have to scale the data use scale to center data first?
  • (4) I have three condition samples, one type is tumor adjencent tissue(8 samples), the second is primary tumor (15 samples),the third is patient with drug treatment(9 samples). Like methioned above, one sample is one batch. Should I use mnnCorrect for each of this samples one by one and then cbind them together or Its better to correct within each condtion separatly first and then combine the three condition later? You are expert in this filed , any advices for my situation? If I just put this samples together I can see a very obviously batch effect. Thats why I have to correct batch before clustering.
  • (5) The over-fitting situation in mnnCorrect. I try some other methods that can mix samples perfect, but I think its overfitting. Most evidence show that immunie cells can cluster together by cell types while tumor specific malignant are clustering by samples show strong heterogeneity. But this method mix the tumor cells also. So does mnnCorrect can save this biology variant? How to do to mix the immunie cells cluster by cell type while make the tumor cells cluster by samples ?
  • (6) Does the strong batch caused by variable genes selected ? Most time we select high variable genes first, but I think there are many genes may sampels specific drived. So If I can exclude this genes, does the batch effect can removed? The same to PCs , If I can exclude the PCs specific to sample, the cluster should not get together by sample? Am I right ?
  • Hope you reply
  • Best
Entering edit mode
  1. Read the documentation in ?fastMNN, I'm not going to repeat it here.
  2. Read the documentation in the workflow.
  3. If you're using fastMNN, you'll get low-dimensional coordinates as output, so there's no need for another PCA step. Read the documentation, etc.
  4. I would merge samples from the same condition first, and then merge across conditions. See the hierarchical merge instructions and justification.
  5. See my comments above for removal of biological differences. What's the point in having separate clusters of tumor cells? There's no point in merging the samples if you're just going to look at each cluster separately. Alternatively, if you just want to look at immune cells, then extract out the immune cells from each sample (e.g., "gating" based on CD45) and do the merge on those instead. It's unfair to ask the merge algorithm to merge together tumor cells and then complain when they get merged!
  6. It is possible to reduce the batch effect by removing genes that are variable across samples. However, this may also remove relevant biological differences within a batch, if the population composition differs between batches. For example, if two batches contain T/B cells, and one batch contains mostly T cells and the other batch contains mostly B cells, then CD3/CD19/etc. will be variable across batches; but removing those genes will also remove heterogeneity within the batch. The same applies for PCs; it is tempting to remove PCs that are correlated with batch, but this has the opportunity to remove structure within the batch if the population composition varies. (And if the population composition does not vary, then there is no need for complicated batch correction methods - just use something simple like batchelor::rescaleBatches.)
Entering edit mode

Sincere thanks, Aaron. Best wishes.


Login before adding your answer.

Traffic: 339 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6