Question

Removing cell cycle genes in large data set integration

1

Entering edit mode

p.joshi ▴ 40

@pjoshi-22718

Last seen 3.7 years ago

Germany

Hi,

This time I am trying to understand how can I deal with effect of cell cycle in integrating datasets. My data consist of cell across developmental windows that I am integrating using fastMNN(). To mitigate the effect of cell cycle genes, I used an easier approach of removing the cell cycle genes from downstream analysis (Chp 16, OSCA tutorial), which didn't have much effect on the final integration/cluster assignment. I used a custom list of 750 genes that were discarded from analysis.

As there are genes that could still be involved in cell cycle or are affected by cell cycle, I still want to regress out the effect of cell cycle. I found this tutorial which was helpful, but I have following doubts how to proceed.

1) Shall I process individual dataset for cell cycle effect and used the obtained "corrected" logcounts for integration? 2) Or can I use the output of multiBatchNorm() and perform correction on that and then use the corrected log counts for fastMNN? 3) Also how can I validate that the cell cycle gene effect was removed in the integrated data, as I cannot plot a PCA for integrated data?

Piyush

scran cyclone • 7.5k views

ADD COMMENT • link updated 5.8 years ago by Aaron Lun ★ 29k • written 5.8 years ago by p.joshi ▴ 40

score 5 · Accepted Answer · 2020-05-24

As there are genes that could still be involved in cell cycle or are affected by cell cycle, I still want to regress out the effect of cell cycle.

Is this really something you want to do, or are you just doing it because you were told to do it?

I think that people are far, far, far too relaxed about using linear regression to remove uninteresting factors of variation in single-cell data. There is often a lack of appreciation of the underlying assumption - that the covariate(s) being regressed out are orthogonal to the underlying biology. For technical factors of variation, only the most well-designed experiments can reasonably claim to satisfy this assumption; for biological factors like cell cycle, stress response, etc., the default attitude should be that the assumption is violated unless shown otherwise.

Violations of this assumption lead to incorrect results in the - ahem - "corrected" expression values. No other way to say it. I have witnessed people blindly regressing away the batch effect when their batches were confounded with different biological states. This obviously wipes out the biological differences that they were interested in, leaving them wondering why their states disappeared. However, the opposite effect can also happen. It is entirely possible to introduce false positive differences via regression when the population composition differs across different levels of the covariate being regressed out. (This is, in fact, the original motivation for the use of other batch correction methods like MNN.)

I speculate that people like using linear regression for single-cell data because (i) it's simple and so they don't have to think, and (ii) it's what they did for bulk RNA-seq and so they don't have to think. However, the bulk analogy is poor because we usually know all of the other factors of interest in a bulk experiment. If we used limma::removeBatchEffect, we would almost certainly supply design= to ensure that our factors of interest are not altered upon removal of the batch effect specified in batch=. This knowledge is not available in single-cell experiments and so there is little precedent to be drawn from our bulk procedures.

In your case, do you truly believe that cell cycle is independent of the other biological processes happening in your dataset? You're talking about development here, and I would be extremely surprised if there wasn't some correlation between cell cycle and other interesting things like differentiation. All it takes is for a few lineages to exhibit more or less proliferation than the others and you can kiss goodbye to the above assumption.

Your argument is that cycling might affect genes other than the cell cycle genes, in which case removal of the latter is not enough. However, if they're not cell cycle genes, and their expression is altered... is this really just a cell cycle effect anymore? Or is it another process that just happens to be correlated with the cell cycle? If you can answer this question with "yes, it is still a cell cycle effect" (followed by a "I also cannot be bothered to repeat the analysis after removing those extra sort-of-but-not-cell-cycle genes"), only then may you have a case for using regression.

You mention that you got similar results before and after removing cell cycle genes - this sounds like good news to me, in that your conclusions are robust to any differences in cycling activity. IMO regression should be a procedure of last resort that - for any reasonably heterogeneous dataset - raises as many questions as it answers.

1) Shall I process individual dataset for cell cycle effect and used the obtained "corrected" logcounts for integration? 2) Or can I use the output of multiBatchNorm() and perform correction on that and then use the corrected log counts for fastMNN?

If you must do this, I would suggest:

Running multiBatchNorm().
Running regressBatches() on each SCE with the covariates you want to regress out in design=.
Running fastMNN() on the corrected matrices, probably with cos.norm=FALSE.

It's difficult to predict what will crawl out the end of this procedure, but there you have it.

3) Also how can I validate that the cell cycle gene effect was removed in the integrated data, as I cannot plot a PCA for integrated data?

I don't know how you were demonstrating that the cell cycle effect was there in the first place, but it seems you could just do the same thing on the low-dimensional result after correction. If you need some PCs after step 2, you can just use multiBatchPCA() or its simpler cousin scater::runPCA().