Question

Cell cycle regression for scRNA-seq data

0

Entering edit mode

ATpoint ★ 4.0k

@atpoint-13662

Last seen 24 minutes ago

Germany

My scRNA-seq data (10X, murine, hematopoietic cells) have the problem that some clusters are separated almost exclusively by cell cycle which is not interesting for the scenario we are woring with and only inflates the number of clusters. This can be shown both with PCA run on cell cycle genes (separation there is obvious in PC1 vs PC2 for some clusters) plus with cyclone cell cycle assignment as in the book chapter 16.4. Therefore I would like to remove the effect e.g. as in the book chapter 16.5. Removal of the cell cycle genes from the selected features is not sufficient and does not really make any difference, therefore looking for a more aggressive strategy. Following chapter 16.4 I am not clear on the exact workflow from there on. Do we run regressBatches on the original logcounts and then repeat the feature selection, integration and clustering procedure? Also, is there something similar in the Bioconductor world as in the last chapter of the Seurat vignette where not the cell cycle effect itself but the difference between the G2M and S phase scores is regressed?

Thanks for your suggestions!

OSCA batchelor cell cycle regression • 2.4k views

ADD COMMENT • link updated 3.7 years ago by Aaron Lun ★ 28k • written 3.7 years ago by ATpoint ★ 4.0k

score 2 · Accepted Answer · 2020-08-18

Do we run regressBatches on the original logcounts and then repeat the feature selection, integration and clustering procedure?

The latest version of the chapter has a bit more information available. Briefly, the regression just applies to the log-values you feed into the PCA. Clustering picks up from the PCs, so it doesn't need extra regression. And feature selection can use block= to ensure that cell cycle differences do not drive the detection of HVGs.

I take it you've read and understood my comments on the potential problems from using regression, so I won't repeat them here. I will just say that I would still prefer gene removal as this is more predictable and less liable to introduce artifacts - see the new version of the chapter for a more aggressive empirical version of this approach.

Also, is there something similar in the Bioconductor world as in the last chapter of the Seurat vignette where not the cell cycle effect itself but the difference between the G2M and S phase scores is regressed?

Sure, if you're got a covariate, just make a design matrix and give it to design= in regressBatches(). (Similarly, you can give it to design= in functions like findMarkers().) You can put anything in there, e.g., the cyclone() phase scores or the SingleR() correlations. However, I have come to wonder whether this hurts more than it helps; the magnitude of the scores is probably even more sensitive to confounding differences in the biological state.