Question

Does countsFromAbundance="lengthScaledTPM" produce un-normalized counts?

0

Entering edit mode

Christian • 0

@05fca8d9

Last seen 6 months ago

Belgium

Hi everyone,

The DESeq2 documentation states that input needs to be un-normalized counts (=raw counts?), while tximport suggests for salmon data to apply countsFromAbundance="lengthScaledTPM" and use the result as a regular count matrix.

To my understanding tximport implies that it's length scaled counts can be treated like un-normalized counts in DESeq2. But why? Is it because these bias corrected counts from tximport are different from normalized or transformed counts?
Are there any reasons to prefer importing raw read counts (with offset) over countsFromAbundance="lengthScaledTPM" or can they be used likewise?
As the outcome of countsFromAbundance="lengthScaledTPM" is corrected on a similar scale as TPMs: Could I use the data likewise, e.g., to compare counts between different genes within the same sample? And after transformation via rlog/vst would the values be on a scale that could be compared between genes AND between samples? Basically TPM would not be required anymore?

I´m grateful for any clarification.

tximport DESeq2 • 1.3k views

ADD COMMENT • link updated 7 months ago by Michael Love 41k • written 8 months ago by Christian • 0

score 1 · Answer 1 · 2023-08-29

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 12 hours ago

United States

All output of tximport is proportional to original sequencing depth. That is, sequence depth effect has not been removed. This is why the data handoffs (e.g. to DESeq2, edgeR, limma-voom) in the vignette make sense.

I prefer raw read count with offset, as it is the closest to the original error model, just correcting for biases via offset. The others typically perform identically in simulation (see Soneson 2015).

No counts produced by tximport are not interpretable like TPM. Recommend to use TPM directly: txi$abundance.

ADD COMMENT • link 8 months ago Michael Love 41k

0

Entering edit mode

Thank you for the response (and to the other responders as well) and apologies for my late reply. Especially the reference to Soneson is very helpful. I think I was simply confused by isoform length scaled vs. gene length normalized. If I understand correctly the scaling to counts really refers to actual upscaling of the TPM values to the level of counts via library size (ScaledTPM) and to the level of counts + considering different isoform lengths (lengthScaledTPM).

However, do the 'ScaledTPM' at least inherit the normalisation by gene length from TPM and have this as advantage over the original raw counts? Then 'lengthScaledTPM' add another layer to the data already normalized by gene length by considering different isoform lengths between samples? Otherwise, what would be the advantage of ScaledTPM over original raw counts (without offset)?

I´m grateful for further clarification.

Kind regards, Christian

ADD REPLY • link 7 months ago Christian • 0

1

Entering edit mode

Advantage and disadvantage depends on the aim, what you plan to do with them (testing, plots, etc.).

lengthScaledTPM puts back in the gene length by the way. Just does so in a way that differential transcript usage won't bias when performing differential gene expression analysis.

scaledTPM have the advantage over raw counts that the latter are subject to this bias (DTU can produce spurious, "apparent" DGE).

ADD REPLY • link 7 months ago Michael Love 41k

0

Entering edit mode

Many thanks for clarification.

Then only scaledTPM inherits the normalisation by gene length from TPM, while in lengthScaledTPM the feature length is added back again and we are on a similar level as in raw counts. However, both methods account for DTU between samples.

Would this also mean that if someone would try to calculate TPM values according to the standard formula (as used for raw counts), this would be valid using lengthScaledTPM counts, but not for ScaledTPM?

ADD REPLY • link 7 months ago Christian • 0

0

Entering edit mode

I do not follow why calculate TPM from something derived from TPM.

ADD REPLY • link 7 months ago Michael Love 41k

0

Entering edit mode

I get your point as we already starting with TPM. This was more to clarify that I understand the principle correctly.

However, further I´m checking out a differential expression pipeline, which uses DESeq2, but currently only accepts gene count matrices as input (No support of the offset) and would like to clarify if either using 'lengthScaledTPM' or 'scaledTPM' or 'original raw counts' count matrices would be a valid choice. I did not check if this is actually happening, but if this pipeline would recalculate TPMs based on 'lengthScaledTPM' (or 'scaledTPM' or 'original raw counts') would this still provide reasonable results?

ADD REPLY • link 7 months ago Christian • 0

0

Entering edit mode

For the reasons outlined in the Soneson paper, I'd prefer this pipeline:

https://nf-co.re/rnaseq

You cannot use these scaled TPM approaches with gene count tables that don't involve transcript abundance estimation.

ADD REPLY • link 7 months ago Michael Love 41k

0

Entering edit mode

I´m sorry causing confusion here, but the gene matrices I refer to are actually obtained via https://nf-co.re/rnaseq.

The problem with the 'rnaseq' pipeline is that it produces count matrices with the different methods described above, but does not perform further downstream processing (i.e., differential gene expression), expect of PCA maybe. For this downstream processing there is another pipeline (https://nf-co.re/differentialabundance/) existing, which based on gene count matrices can perform treatment comparisons and other downstream tasks. So, while this downstream pipeline is not capable of making use of the 'offset' for raw counts via DESeq2 (as this is not implemented yet), it can apply standard DeSeq2 processing based on the lengthScaledTPM/scaledTPM counts matrices as input. At least this is what I understand from the documentation of tximport where is written:

The second method is to use the tximport argument countsFromAbundance="lengthScaledTPM" or "scaledTPM", and then to use the gene-level count matrix txi$counts directly as you would a regular count matrix

Does this sound reasonable, but maybe I misunderstood the concept here? I´m just searching for the best way to proceed and further I was just not sure if TPMs can be recalculated as the pipeline uses counts matrices as input without the TPMs from salmon.

ADD REPLY • link 7 months ago Christian • 0

0

Entering edit mode

Oh, I see.

If you can't use counts + offset then I prefer lengthScaledTPM to DESeq2, i.e. txi$counts to DESeqDataSetFromMatrix().

So you can use the first pipeline to produce those counts, and analyze it with R or this other pipeline it sounds like.

ADD REPLY • link 7 months ago Michael Love 41k

score 1 · Answer 2 · 2023-08-29

Normalization of counts is unrelated to scaling to transcript length. The former is intended to account for differences in library size, whereas the latter is intended to adjust for differences in average transcript length. Both will affect the counts/gene, but the library size is a sample-wide phenomenon (all things equal, gene counts from a library with 20M reads should be about twice those from a library with 10M reads), and the length-bias is a gene-wise phenomenon (e.g., assume gene X has two transcripts, one of which is twice the length of the other. All things equal (same number of transcripts), you expect somewhere around twice the reads for the longer transcript than the shorter).
Mike Love may be along to provide more input, but for me, I use 'regular' counts when modeling the counts using a GLM, and length scaled counts when I am using limma-voom.
You could, I suppose. But there are other considerations. For example, GC content will affect the number of reads for a gene, so if you are using something like WGCNA, where the gene data are +/- assumed to be iid, I would tend to use length scaled data and then normalize using cqn to adjust for GC content and differences in gene lengths. But I can't come up with another use case where one might compare counts of genes within a sample.

score 1 · Answer 3 · 2023-08-30

1) The scaling is done to correct for different transcript composition per gene and sample. If sample-1 expresses long isoforms for gene-A and sample-2 expresses shorter ones (at same expression level) then sample-1 gets more counts for that gene, since the transcripts are longer. tximport corrects for the composition but it does not do any library size normalization. That is, of sample-1 has 10M reads and sample-2 has 50mio then you still have to correct for that. tximport counts are corrected for the transcript length composition but not for read depth.

2) I personally prefer lengthScaledTPM since they're generic. You can put them into any downstream analysis, also with tools that do not support offsets. As Mike says, results are very similar in my experience, no matter what you choose.

3) See 1), the correction deals with length composition of transcript per gene, it does not do any gene length scaling like TPM not corrects for depth. You need to do that with the raw counts as usual before downstream analysis.