Cell cycle regression for scRNA-seq data
1
0
Entering edit mode
ATpoint ▴ 860
@atpoint-13662
Last seen 10 days ago
Germany

My scRNA-seq data (10X, murine, hematopoietic cells) have the problem that some clusters are separated almost exclusively by cell cycle which is not interesting for the scenario we are woring with and only inflates the number of clusters. This can be shown both with PCA run on cell cycle genes (separation there is obvious in PC1 vs PC2 for some clusters) plus with cyclone cell cycle assignment as in the book chapter 16.4. Therefore I would like to remove the effect e.g. as in the book chapter 16.5. Removal of the cell cycle genes from the selected features is not sufficient and does not really make any difference, therefore looking for a more aggressive strategy. Following chapter 16.4 I am not clear on the exact workflow from there on. Do we run regressBatches on the original logcounts and then repeat the feature selection, integration and clustering procedure? Also, is there something similar in the Bioconductor world as in the last chapter of the Seurat vignette where not the cell cycle effect itself but the difference between the G2M and S phase scores is regressed?

Thanks for your suggestions!

OSCA batchelor cell cycle regression • 576 views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 25 minutes ago
The city by the bay

Do we run regressBatches on the original logcounts and then repeat the feature selection, integration and clustering procedure?

The latest version of the chapter has a bit more information available. Briefly, the regression just applies to the log-values you feed into the PCA. Clustering picks up from the PCs, so it doesn't need extra regression. And feature selection can use block= to ensure that cell cycle differences do not drive the detection of HVGs.

I take it you've read and understood my comments on the potential problems from using regression, so I won't repeat them here. I will just say that I would still prefer gene removal as this is more predictable and less liable to introduce artifacts - see the new version of the chapter for a more aggressive empirical version of this approach.

Also, is there something similar in the Bioconductor world as in the last chapter of the Seurat vignette where not the cell cycle effect itself but the difference between the G2M and S phase scores is regressed?

Sure, if you're got a covariate, just make a design matrix and give it to design= in regressBatches(). (Similarly, you can give it to design= in functions like findMarkers().) You can put anything in there, e.g., the cyclone() phase scores or the SingleR() correlations. However, I have come to wonder whether this hurts more than it helps; the magnitude of the scores is probably even more sensitive to confounding differences in the biological state.

ADD COMMENT
0
Entering edit mode

Thanks Aaron for the extensive comment, very helpful as usual!

ADD REPLY
1
Entering edit mode

I just noticed that setting design= in regressBatches() actually also requires you to give it something like batch=integer(ncol(sce)) to keep the function happy. (It doesn't matter what the exact value is, you just had to give it something to let it move on to the next step.) I've updated the function in BioC-devel so that it no longer needs batch= if you give it design=.

ADD REPLY
0
Entering edit mode

One suggestion to add to the book:

For context, this is the UMAP in my case. All clusters above 0 in UMAP2 are almost exclusively in G1 and those below 0 are in mixed states. In my case it was not sufficient simply to remove annotated cell cycle genes (as those from the GO term) from the HVGs that go into the clustering. In fact results were almost identical.

Instead it turned out that removing all genes being upregulated in the "upper" clusters vs "lower" clusters did the trick and completely removed this separation while preserving the expected cell type separation. Used pairwiseWilcox for it, setting lfc relatively high (log2(1.5)) to really only capture the strong "drivers" of that phenomenon. In fact of the about 100 genes that we are talking about which were removed from clustering only about half were listed in that GO term, the rest were mainly histone genes and genes related to RNA processing and nucleotide metabolism, not unexpected for G1-phase cells I guess, but still not in that GO.

Maybe this is something you can somehow include into the book as an additional exploration strategy if removal of CC genes alone is not sufficient. In fact regression in my case completely messed up the data and vanished any reasonable cell type separation, so your advise to apply regression with care is definitely a good one.

Thanks again for your advise!

ADD REPLY

Login before adding your answer.

Traffic: 254 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6