In my understanding, applying size factor to raw count by scater::normalize()
is one type of between-sample normalization. Thus the expression within the same gene are comparable among different sample/cells. However, the size factor seems to doesn't account for gene length effect like TPM, so it should not to use it to do a expression level comparision between different genes, even within the same sample/cell. I was wondering that do I understand right ? If that is right, how to compare expression level between two genes, especially for droplet-based data without TPM value for each genes ?
Thanks for your quick reply and detailed response. I guess there is some misundering here since my question is more about visualisation but not about scientific question. Several single cell paper would like to show markers gene expression for each cluster like that:
Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets Figure 5D, Figure S5A
(https://marlin-prod.literatumonline.com/cms/attachment/f7ef631e-7b3d-4ce9-820c-eb86ef36d507/figs6_lrg.jpg) Thus I was wondering that is from such plots , can we say gene A has "higher expression" than gene B (on average ) in cluster 1 ? From your answer, it seems that UMI-based quantification is OK but not holds for read-based quantification , do I understand right ?
If gene A has higher UMI counts than gene B, the only thing you can say with certainty is that, well, gene A has more detected UMIs than gene B. If you want to say that A has more transcripts than B, you will need to assume that the capture (and PCR, and sequencing, and mapping) efficiencies are the same across genes. That's a pretty strong assumption, so I wouldn't do it. I won't even go into the varying biological interpretations of "higher expression".
To me, the main point of your linked figure is to compare expression of the same gene across clusters to identify the cell type corresponding to each cluster. I don't think there's any need or intent to compare genes within the same cluster. You might say that you could compare silent genes to genes with non-zero counts; but, even in the most obvious case where a gene is silent in a cluster, you'd want to see some non-zero expression in another cluster to assure yourself that the zero counts in the first cluster are due to lack of expression rather than systematic problems with capture/PCR/whatever.
As for your final question, UMI count data are better than read-based data for counting transcripts, as the former avoids issues with transcript length. But UMIs still suffer from issues with capture efficiency and so on. I'd wager that some sort of normalization would be necessary to obtain accurate transcript counts from UMI data (especially if you want absolute counts, in which case you'd need a spike-in standard curve), but I've never done this as it has never been of interest to me.
Thanks a lot ! Sometimes I also see people trying to define whether a gene is expressed in a cell using TPM > 1 (read-based ) or UMI counts > 1 as cutoff. A typical case is to define whether a T cell is CD4+ or CD8+ T cell. Since the gene capture difference is quite different, I am not sure is that a correct way and what do you think ?
Comparison to a hard threshold for binarization of expression values is a different problem to what you've described in your original question. Suffice to say that I don't think much of this approach, so scater and scran don't do it.