Mitigating cell cycle effect in scRNA-seq using a blocking factor or design matrix
Entering edit mode
s1437643 ▴ 10
Last seen 3.0 years ago

I have an obvious cell cycle effect in my scRNA-seq data (treatment versus control layout) which I would like to mitigate in my analysis. My plan was to use the cell cycle assignment scores (calculated by cyclone) as covariates in the removeBatchEffect function to create a 'corrected' expression matrix, but perform feature selection and marker detection on the 'uncorrected' expression matrix using the cell cycle assignments as blocking terms. However, I'm confused about whether to use a blocking factor (the cell cycle assignments) versus a design matrix of covariates (cell cycle assignment scores) in some of the downstream functions within the scran package. For example, the modelGeneVar function can take either a blocking factor or a design matrix. Would the blocking factor simply be a character vector of cell cycle assignments (e.g. G1, G2M, or S) and the design be the same matrix of covariates as supplied to the removeBatchEffect function? The findMarkers function can also take either a blocking factor or design matrix. Is there a reason to prefer one over the other?

scran • 676 views
Entering edit mode
Aaron Lun ★ 27k
Last seen 2 hours ago
The city by the bay

As a general rule, block= is always safer than design=. The former literally processes each block separately and combines the results, which allows us to handle differences in the mean-variance trend (in modelGeneVar()) or differences in variance between groups (in findMarkers()). The use of a design matrix causes these methods to switch to linear models, which makes more assumptions about how similar the different blocking levels are. Nonetheless, design= may be necessary in some cases, e.g., if you have a set-up where all cells in one cluster are in one blocking level and all cells in another cluster are in another level, it's not possible to compare them by using findMarkers() with block=. These points are discussed briefly in the documentation.

Honestly speaking, I have mixed feelings about regressing the cell cycle effect. It seemed like a good idea at the time, and everyone was doing it, so that's why I talked about it in the workflow. But I've become increasingly concerned that the cell cycle is not entirely orthogonal to biological processes of interest, and attempting to regress it out could cause more trouble than it's worth. For example, if one cell type cycles more actively than another cell type, or when we're talking about T cell activation, trying to regress out cell cycle effect could cripple your signal (or even worse, introduce spurious signal). I've also had some nagging doubts about the accuracy of cell type calls from cyclone() or related methods that are based on classifiers learnt from a single reference dataset - it's not hard to find situations where the test dataset involves different cell types that don't behave much like the reference w.r.t. cell cycle-associated genes.

If I had to do it, I would block on the phase assignments, but I'm starting to consider whether just tossing out all genes with annotated associations to the cell cycle would be a safer approach (e.g., based on all terms in GO:0007049, which pretty much covers anything that might be associated with the cell cycle). This won't get rid of unannotated genes that have expression correlated to the cell cycle, but any such effect is indistinguishable from the activity of a separate pathway that happens to be correlated to the cell cycle.

Entering edit mode

I am currently analyzing 2 different datasets and having much concern regarding cell cycle effect.

The first dataset models an ex vivo developmental process, which contains a lot of cycling cells. No matter which method I use to regress out the 'cell cycle' effect, it keeps being very obvious after any form of regression... which makes me think that any attempt using the current approaches might be suboptimal at doing so for this particular dataset, and may create more harm than good. And actually, as pointed out by Aaron, cell cycle is a very important biological process taking place in this developmental process. Until advised otherwise, I believe that the effect should not be regressed out in this case.

The other dataset looks at the adult/homeostatic state of the tissue. Few cells appear to be proliferating in each of the distinct population. So, cells primarily cluster by cell type, and then within each cell type, there are 'subclusters' of cycling cells. In this case, keeping cell cycle felt like it gives a bit of an articificial pattern to the shape of the clusters. I found the last point described by Aaron to be very effective to deal with this (removing cell cycle associated genes in the list of variable genes used for dimension reduction, from the list of genes in GO:0007049). However, I realised that quite a few genes present in this list (GO:0007049) were also known developmental regulators of my tissue of interest. To circumvent this, I substracted the genes that were in the GO terms of my tissue of interest. I have to say that this didn't change much the overall picture compared to taking all GO:0007049, but it feels safer...


Login before adding your answer.

Traffic: 297 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6