Does countsFromAbundance="lengthScaledTPM" produce un-normalized counts?
Entering edit mode
Christian • 0
Last seen 25 days ago

Hi everyone,

The DESeq2 documentation states that input needs to be un-normalized counts (=raw counts?), while tximport suggests for salmon data to apply countsFromAbundance="lengthScaledTPM" and use the result as a regular count matrix.

  1. To my understanding tximport implies that it's length scaled counts can be treated like un-normalized counts in DESeq2. But why? Is it because these bias corrected counts from tximport are different from normalized or transformed counts?
  2. Are there any reasons to prefer importing raw read counts (with offset) over countsFromAbundance="lengthScaledTPM" or can they be used likewise?
  3. As the outcome of countsFromAbundance="lengthScaledTPM" is corrected on a similar scale as TPMs: Could I use the data likewise, e.g., to compare counts between different genes within the same sample? And after transformation via rlog/vst would the values be on a scale that could be compared between genes AND between samples? Basically TPM would not be required anymore?

I´m grateful for any clarification.

tximport DESeq2 • 218 views
Entering edit mode
Last seen 5 days ago
United States

All output of tximport is proportional to original sequencing depth. That is, sequence depth effect has not been removed. This is why the data handoffs (e.g. to DESeq2, edgeR, limma-voom) in the vignette make sense.

I prefer raw read count with offset, as it is the closest to the original error model, just correcting for biases via offset. The others typically perform identically in simulation (see Soneson 2015).

No counts produced by tximport are not interpretable like TPM. Recommend to use TPM directly: txi$abundance.

Entering edit mode
Last seen 2 days ago
United States
  1. Normalization of counts is unrelated to scaling to transcript length. The former is intended to account for differences in library size, whereas the latter is intended to adjust for differences in average transcript length. Both will affect the counts/gene, but the library size is a sample-wide phenomenon (all things equal, gene counts from a library with 20M reads should be about twice those from a library with 10M reads), and the length-bias is a gene-wise phenomenon (e.g., assume gene X has two transcripts, one of which is twice the length of the other. All things equal (same number of transcripts), you expect somewhere around twice the reads for the longer transcript than the shorter).
  2. Mike Love may be along to provide more input, but for me, I use 'regular' counts when modeling the counts using a GLM, and length scaled counts when I am using limma-voom.
  3. You could, I suppose. But there are other considerations. For example, GC content will affect the number of reads for a gene, so if you are using something like WGCNA, where the gene data are +/- assumed to be iid, I would tend to use length scaled data and then normalize using cqn to adjust for GC content and differences in gene lengths. But I can't come up with another use case where one might compare counts of genes within a sample.
Entering edit mode
ATpoint ★ 3.4k
Last seen 3 minutes ago

1) The scaling is done to correct for different transcript composition per gene and sample. If sample-1 expresses long isoforms for gene-A and sample-2 expresses shorter ones (at same expression level) then sample-1 gets more counts for that gene, since the transcripts are longer. tximport corrects for the composition but it does not do any library size normalization. That is, of sample-1 has 10M reads and sample-2 has 50mio then you still have to correct for that. tximport counts are corrected for the transcript length composition but not for read depth.

2) I personally prefer lengthScaledTPM since they're generic. You can put them into any downstream analysis, also with tools that do not support offsets. As Mike says, results are very similar in my experience, no matter what you choose.

3) See 1), the correction deals with length composition of transcript per gene, it does not do any gene length scaling like TPM not corrects for depth. You need to do that with the raw counts as usual before downstream analysis.


Login before adding your answer.

Traffic: 816 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6