Removing cell cycle genes in large data set integration
1
1
Entering edit mode
p.joshi ▴ 30
@pjoshi-22718
Last seen 7 months ago
Germany

Hi,

This time I am trying to understand how can I deal with effect of cell cycle in integrating datasets. My data consist of cell across developmental windows that I am integrating using fastMNN(). To mitigate the effect of cell cycle genes, I used an easier approach of removing the cell cycle genes from downstream analysis (Chp 16, OSCA tutorial), which didn't have much effect on the final integration/cluster assignment. I used a custom list of 750 genes that were discarded from analysis.

As there are genes that could still be involved in cell cycle or are affected by cell cycle, I still want to regress out the effect of cell cycle. I found this tutorial which was helpful, but I have following doubts how to proceed.

1) Shall I process individual dataset for cell cycle effect and used the obtained "corrected" logcounts for integration? 2) Or can I use the output of multiBatchNorm() and perform correction on that and then use the corrected log counts for fastMNN? 3) Also how can I validate that the cell cycle gene effect was removed in the integrated data, as I cannot plot a PCA for integrated data?

Piyush

scran cyclone • 3.1k views
5
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 15 hours ago
The city by the bay

As there are genes that could still be involved in cell cycle or are affected by cell cycle, I still want to regress out the effect of cell cycle.

Is this really something you want to do, or are you just doing it because you were told to do it?

I think that people are far, far, far too relaxed about using linear regression to remove uninteresting factors of variation in single-cell data. There is often a lack of appreciation of the underlying assumption - that the covariate(s) being regressed out are orthogonal to the underlying biology. For technical factors of variation, only the most well-designed experiments can reasonably claim to satisfy this assumption; for biological factors like cell cycle, stress response, etc., the default attitude should be that the assumption is violated unless shown otherwise.

Violations of this assumption lead to incorrect results in the - ahem - "corrected" expression values. No other way to say it. I have witnessed people blindly regressing away the batch effect when their batches were confounded with different biological states. This obviously wipes out the biological differences that they were interested in, leaving them wondering why their states disappeared. However, the opposite effect can also happen. It is entirely possible to introduce false positive differences via regression when the population composition differs across different levels of the covariate being regressed out. (This is, in fact, the original motivation for the use of other batch correction methods like MNN.)

I speculate that people like using linear regression for single-cell data because (i) it's simple and so they don't have to think, and (ii) it's what they did for bulk RNA-seq and so they don't have to think. However, the bulk analogy is poor because we usually know all of the other factors of interest in a bulk experiment. If we used limma::removeBatchEffect, we would almost certainly supply design= to ensure that our factors of interest are not altered upon removal of the batch effect specified in batch=. This knowledge is not available in single-cell experiments and so there is little precedent to be drawn from our bulk procedures.

In your case, do you truly believe that cell cycle is independent of the other biological processes happening in your dataset? You're talking about development here, and I would be extremely surprised if there wasn't some correlation between cell cycle and other interesting things like differentiation. All it takes is for a few lineages to exhibit more or less proliferation than the others and you can kiss goodbye to the above assumption.

Your argument is that cycling might affect genes other than the cell cycle genes, in which case removal of the latter is not enough. However, if they're not cell cycle genes, and their expression is altered... is this really just a cell cycle effect anymore? Or is it another process that just happens to be correlated with the cell cycle? If you can answer this question with "yes, it is still a cell cycle effect" (followed by a "I also cannot be bothered to repeat the analysis after removing those extra sort-of-but-not-cell-cycle genes"), only then may you have a case for using regression.

You mention that you got similar results before and after removing cell cycle genes - this sounds like good news to me, in that your conclusions are robust to any differences in cycling activity. IMO regression should be a procedure of last resort that - for any reasonably heterogeneous dataset - raises as many questions as it answers.

1) Shall I process individual dataset for cell cycle effect and used the obtained "corrected" logcounts for integration? 2) Or can I use the output of multiBatchNorm() and perform correction on that and then use the corrected log counts for fastMNN?

If you must do this, I would suggest:

1. Running multiBatchNorm().
2. Running regressBatches() on each SCE with the covariates you want to regress out in design=.
3. Running fastMNN() on the corrected matrices, probably with cos.norm=FALSE.

It's difficult to predict what will crawl out the end of this procedure, but there you have it.

3) Also how can I validate that the cell cycle gene effect was removed in the integrated data, as I cannot plot a PCA for integrated data?

I don't know how you were demonstrating that the cell cycle effect was there in the first place, but it seems you could just do the same thing on the low-dimensional result after correction. If you need some PCs after step 2, you can just use multiBatchPCA() or its simpler cousin scater::runPCA().

0
Entering edit mode

Is this really something you want to do, or are you just doing it because you were told to do it?

You could say that. In my un-tested belief cell cycle gene effect is still biologically important for in vivo data. Probably it can be discounted in in vitro experiments. The reason I asked the question is because of contradictory suggestions (I think :D) in the two links I mentioned.

I have witnessed people blindly regressing away the batch effect when their batches were confounded with different biological states

I couldn't agree more, but right now I am on of those people. However, my complication is this. I am trying to integrate experimental data gather over developmental time with various experimental artifacts including: a) person who did the experiment changed from one experiment to other, b) the depth of sequencing changed, c) the instrument of sequencing changed, d) the characterstic of tissue donor changed; among others. Also some experiments have biological replicates. Now I can investigate each replicate to see if they really have batch effect that should worry me; how would I go about correcting them for some while not for others? How will I also differentiate between batch effects and real biological differences without prior knowledge of biology (as it is lacking in the field, not that I am not reading papers)? I am also using fastMNN for correction but the approach in both the sources I linked suggest a linear regression for cell cycle effect.

You're talking about development here, and I would be extremely surprised if there wasn't some correlation between cell cycle and other interesting things like differentiation.

Once again I agree with you. The second complication is that I am trying to integrate some tumor data with this developmental data (at single cell level). Now in the integration the proliferating cell types are clustering together some population of tumor. So I thought to remove the cell cycle genes, but that still didn't change the clustering, which as you suggest could be good news. But being cautious, I also wanted to test if regressing out the effect rather than not including cell cycle genes could produce different results? What would that mean and is it true result is debatable, but I just wanted to see what happens.

From your response, I feel I should not proceed with regression; but thanks for suggesting a way how to do it!

Piyush

3
Entering edit mode

Probably it can be discounted in in vitro experiments.

Well, if you ever had the opportunity to do scRNA-seq of HeLa cell cultures, you'll find that their main biological activity bar none is proliferation. Same for in vitro stimulation of T cells.

The reason I asked the question is because of contradictory suggestions (I think :D) in the two links I mentioned.

I wrote both of them, so I can probably comment on that. The link that describes how to do regression was written some long time ago when everyone thought it was a good idea. You can tell how I feel about it because it was one of the few sections in the simpleSingleCell workflow that I didn't bother migrating to the book - though I didn't have quite the heart to delete it, either, so here we are.

However, my complication is this. I am trying to integrate experimental data gather over developmental time with various experimental artifacts including: a) person who did the experiment changed from one experiment to other, b) the depth of sequencing changed, c) the instrument of sequencing changed, d) the characterstic of tissue donor changed; among others.

I have a technical term for your situation, but saying it would be in violation of this site's code of conduct.

Also some experiments have biological replicates. Now I can investigate each replicate to see if they really have batch effect that should worry me; how would I go about correcting them for some while not for others?

Check out the hierarchical merge approach described in the pancreas chapter. Also consider reading the mouse gastrulation paper in which they use MNN to merge biological replicates and samples across different (but still closely related) developmental timeframes.

How will I also differentiate between batch effects and real biological differences without prior knowledge of biology (as it is lacking in the field, not that I am not reading papers)?

Welcome to single-cell data analysis.

I am also using fastMNN for correction but the approach in both the sources I linked suggest a linear regression for cell cycle effect.

Cell cycle and batch effects are different things. The former occurs between cells within a batch, and the latter occurs between batches. This motivates the use of different strategies to tackle them.

The second complication is that I am trying to integrate some tumor data with this developmental data (at single cell level). Now in the integration the proliferating cell types are clustering together some population of tumor.

If you just want to assign developmental labels to your tumor data, consider using SingleR to annotate the tumor data with the developmental data as the reference. This is a more direct approach to annotation where the algorithm can focus on the differences between labels to improve the accuracy of the mapping.

0
Entering edit mode

I have a technical term for your situation, but saying it would be in violation of this site's code of conduct.

I hope it is not as bad as I am assuming.

I did notice you wrote both of the notes, that's why I asked you here. The "linear model based cell cycle correction" page was updated a month ago, that confused me which opinion is more recent. Without digressing more, thank you for your clarification.

Also thanks for the SingleR reference, I will check this out for sure.