scran: Are the log counts expression comparable among different genes within a sample/cell ?
1
3
Entering edit mode
heyao ▴ 30
@heyao-14543
Last seen 4.1 years ago

In my understanding, applying size factor to raw count by scater::normalize() is one type of between-sample normalization. Thus the expression within the same gene are comparable among different sample/cells. However, the size factor seems to doesn't account for gene length effect like TPM, so it should not to use it to do a expression level comparision between different genes, even within the same sample/cell. I was wondering that do I understand right ? If that is right, how to compare expression level between two genes, especially for droplet-based data without TPM value for each genes ?

scran scater • 1.6k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 17 hours ago
The city by the bay

The immediate answer is that, for droplet-based data with UMIs, each count is already a proxy for the number of transcripts. This is because only one read is sequenced per RNA transcript molecule, independent of its length. So, you could get TPM by just computing CPM with calculateCPM. Of course, this interpretation is affected by gene-specific biases in reverse transcription, PCR, sequencing and mapping. These biases are not easily removed - for example, how much improvement in PCR efficiency does one expect from a shorter transcript? What about secondary structure in the RNA molecule that affects reverse transcription? (And no, UMIs don't fully solve the PCR biases, because a poorly amplified cDNA molecule still won't be detected.)

The longer answer would be to suggest that you reconsider the scientific question you are trying to solve. Let's put aside the technical problems for the moment and imagine that you are able to successfully infer that gene A has "higher expression" (i.e., more transcript molecules) than gene B. Then what? This conclusion doesn't mean that A is more biochemically active than B. Nor does it mean that A's overall biological effect is larger than that of B - for example, if B is a transcription factor, even low expression may be very impactful. Certainly for cell type identification, lowly expressed marker genes are often more useful than constitutively expressed genes. So even if A > B, what would that actually mean?

ADD COMMENT
0
Entering edit mode

Thanks for your quick reply and detailed response. I guess there is some misundering here since my question is more about visualisation but not about scientific question. Several single cell paper would like to show markers gene expression for each cluster like that: 

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets Figure 5D, Figure S5A

(https://marlin-prod.literatumonline.com/cms/attachment/f7ef631e-7b3d-4ce9-820c-eb86ef36d507/figs6_lrg.jpg) Thus I was wondering that is from such plots , can we say gene A has "higher expression" than gene B (on average ) in cluster 1 ? From your answer, it seems that UMI-based quantification is OK but not holds for read-based quantification , do I understand right ?    

ADD REPLY
0
Entering edit mode

If gene A has higher UMI counts than gene B, the only thing you can say with certainty is that, well, gene A has more detected UMIs than gene B. If you want to say that A has more transcripts than B, you will need to assume that the capture (and PCR, and sequencing, and mapping) efficiencies are the same across genes. That's a pretty strong assumption, so I wouldn't do it. I won't even go into the varying biological interpretations of "higher expression".

To me, the main point of your linked figure is to compare expression of the same gene across clusters to identify the cell type corresponding to each cluster. I don't think there's any need or intent to compare genes within the same cluster. You might say that you could compare silent genes to genes with non-zero counts; but, even in the most obvious case where a gene is silent in a cluster, you'd want to see some non-zero expression in another cluster to assure yourself that the zero counts in the first cluster are due to lack of expression rather than systematic problems with capture/PCR/whatever.

As for your final question, UMI count data are better than read-based data for counting transcripts, as the former avoids issues with transcript length. But UMIs still suffer from issues with capture efficiency and so on. I'd wager that some sort of normalization would be necessary to obtain accurate transcript counts from UMI data (especially if you want absolute counts, in which case you'd need a spike-in standard curve), but I've never done this as it has never been of interest to me.

ADD REPLY
0
Entering edit mode

Thanks a lot ! Sometimes I  also see people trying to define whether a gene is expressed in a cell using TPM > 1 (read-based ) or UMI counts > 1 as cutoff. A typical case is to define whether a T cell is CD4+ or CD8+ T cell. Since the gene capture difference is quite different, I am not sure is that a correct way and what do you think ?

 

ADD REPLY
1
Entering edit mode

Comparison to a hard threshold for binarization of expression values is a different problem to what you've described in your original question. Suffice to say that I don't think much of this approach, so scater and scran don't do it. 

ADD REPLY

Login before adding your answer.

Traffic: 560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6