As a general rule, block=
is always safer than design=
. The former literally processes each block separately and combines the results, which allows us to handle differences in the mean-variance trend (in modelGeneVar()
) or differences in variance between groups (in findMarkers()
). The use of a design matrix causes these methods to switch to linear models, which makes more assumptions about how similar the different blocking levels are. Nonetheless, design=
may be necessary in some cases, e.g., if you have a set-up where all cells in one cluster are in one blocking level and all cells in another cluster are in another level, it's not possible to compare them by using findMarkers()
with block=
. These points are discussed briefly in the documentation.
Honestly speaking, I have mixed feelings about regressing the cell cycle effect. It seemed like a good idea at the time, and everyone was doing it, so that's why I talked about it in the workflow. But I've become increasingly concerned that the cell cycle is not entirely orthogonal to biological processes of interest, and attempting to regress it out could cause more trouble than it's worth. For example, if one cell type cycles more actively than another cell type, or when we're talking about T cell activation, trying to regress out cell cycle effect could cripple your signal (or even worse, introduce spurious signal). I've also had some nagging doubts about the accuracy of cell type calls from cyclone()
or related methods that are based on classifiers learnt from a single reference dataset - it's not hard to find situations where the test dataset involves different cell types that don't behave much like the reference w.r.t. cell cycle-associated genes.
If I had to do it, I would block on the phase assignments, but I'm starting to consider whether just tossing out all genes with annotated associations to the cell cycle would be a safer approach (e.g., based on all terms in GO:0007049
, which pretty much covers anything that might be associated with the cell cycle). This won't get rid of unannotated genes that have expression correlated to the cell cycle, but any such effect is indistinguishable from the activity of a separate pathway that happens to be correlated to the cell cycle.
I am currently analyzing 2 different datasets and having much concern regarding cell cycle effect.
The first dataset models an ex vivo developmental process, which contains a lot of cycling cells. No matter which method I use to regress out the 'cell cycle' effect, it keeps being very obvious after any form of regression... which makes me think that any attempt using the current approaches might be suboptimal at doing so for this particular dataset, and may create more harm than good. And actually, as pointed out by Aaron, cell cycle is a very important biological process taking place in this developmental process. Until advised otherwise, I believe that the effect should not be regressed out in this case.
The other dataset looks at the adult/homeostatic state of the tissue. Few cells appear to be proliferating in each of the distinct population. So, cells primarily cluster by cell type, and then within each cell type, there are 'subclusters' of cycling cells. In this case, keeping cell cycle felt like it gives a bit of an articificial pattern to the shape of the clusters. I found the last point described by Aaron to be very effective to deal with this (removing cell cycle associated genes in the list of variable genes used for dimension reduction, from the list of genes in GO:0007049). However, I realised that quite a few genes present in this list (GO:0007049) were also known developmental regulators of my tissue of interest. To circumvent this, I substracted the genes that were in the GO terms of my tissue of interest. I have to say that this didn't change much the overall picture compared to taking all GO:0007049, but it feels safer...