does the classification applies to any type of scRNA-seq assay (SMART-seq, CELL-seq, DRO-seq, etc), or mainly to the scRNA-seq where there is full-length RNA sequencing and we do see a higher number of scRNA-seq reads per gene (eg SMART-seq2) ?
The classifier is based on ranks of expression within each cell, which is robust to how much you increase or decrease the library size of that cell. The real question is whether this classifier is robust to changes in coverage between genes when you switch to different technologies. For example, gene A may have higher counts than gene B in full length protocols, but may have fewer counts in 3'-based protocols. I don't have any real idea of how badly the classifier is affected by such differences, but the original paper did see decent performance for a range of datasets generated from different scRNA-seq protocols, so it's probably okay.
If that doesn't work out for you, another cheap approach would be to get a list of marker genes for each phase and to simply perform Wilcoxon tests between each pair of phases using each individual cell's expression profile. Each cell is assigned to the phase where its genes have the highest expression. The challenge becomes where to get these markers - it's hard to find a curated reference source, and KEGG is less than helpful. GO provides some genes for G1 (GO:0000080), S (GO:0000084), G1/S (GO:0000082) and G2/M (GO:0000086), so this might be a good place to start. Of course, you could just pull some markers from some random experiment, as is done by cyclone. However, I find this rather unappealing, and I would have thought that we would have better reference annotation for such a well-studied biological process.
Just curious about "simply perform Wilcoxon tests between each pair of phases using each individual cell's expression profille, Each cell is assigned to the phase where its genes have the highest expression. "
How to do test within each individual cells ? Is that a simple and effective way to quantify a cell status defined by gene module like ssGSEA or Singscore ?
My version of this idea came from limma::wilcoxGST, which compares the mean rank of genes in the set with the mean rank of genes outside the set. You can do this for each phase and then pick the phase with the highest score. (Technically this is not quite the same as a Wilcoxon test between phases, but it should be pretty close.) I think the AUCell package uses this approach or something very much like it, though I would have to look at its supporting documentation more closely to be sure.
I should point out that all of these approaches are competitive gene set tests, which has some implications for interpretation of the results. The most obvious example is when you have two cells A and B that perform process X to the same level of "activity" (i.e., the genes involved in X have the same distribution of expression in A and B). However, cell B also performs process Y at a higher level of activity than X. Because we're doing competitive tests, the gene set score for X in B is lower than that in A, simply because Y is pushing down the ranks for X in B. This might lead one to say that B is doing less of X than A, but that's not true.
In this specific case of phase assignment, the above scenario is not a problem because we assume that the cell must be in one of G1, S or G2/M. We are thus only comparing between gene sets in the same cell, which is a valid application for competitive methods. We are never comparing gene set scores between cells, which would be much riskier. The same arguments apply for other types of assignment like cell type, where we assume that cells must be one of the choices.
Incidentally, if you're using GO, the G1 marker genes are annotated under "G1/S transition". I guess that makes sense because if any genes are promoting or repressing the transition, they must have been expressed during G1. Similarly, S phase genes are annotated under "mitotic DNA replication". So the marker genes are there, just not in a particularly obvious place.
thank you Aaron ! i am going again over the scRNA-seq tutorials: thank you for adding the tutorial on differential expression !
Just curious about "simply perform Wilcoxon tests between each pair of phases using each individual cell's expression profille, Each cell is assigned to the phase where its genes have the highest expression. " How to do test within each individual cells ? Is that a simple and effective way to quantify a cell status defined by gene module like ssGSEA or Singscore ?
My version of this idea came from
limma::wilcoxGST
, which compares the mean rank of genes in the set with the mean rank of genes outside the set. You can do this for each phase and then pick the phase with the highest score. (Technically this is not quite the same as a Wilcoxon test between phases, but it should be pretty close.) I think the AUCell package uses this approach or something very much like it, though I would have to look at its supporting documentation more closely to be sure.I should point out that all of these approaches are competitive gene set tests, which has some implications for interpretation of the results. The most obvious example is when you have two cells A and B that perform process X to the same level of "activity" (i.e., the genes involved in X have the same distribution of expression in A and B). However, cell B also performs process Y at a higher level of activity than X. Because we're doing competitive tests, the gene set score for X in B is lower than that in A, simply because Y is pushing down the ranks for X in B. This might lead one to say that B is doing less of X than A, but that's not true.
In this specific case of phase assignment, the above scenario is not a problem because we assume that the cell must be in one of G1, S or G2/M. We are thus only comparing between gene sets in the same cell, which is a valid application for competitive methods. We are never comparing gene set scores between cells, which would be much riskier. The same arguments apply for other types of assignment like cell type, where we assume that cells must be one of the choices.
Incidentally, if you're using GO, the G1 marker genes are annotated under "G1/S transition". I guess that makes sense because if any genes are promoting or repressing the transition, they must have been expressed during G1. Similarly, S phase genes are annotated under "mitotic DNA replication". So the marker genes are there, just not in a particularly obvious place.