Question

tximport counts versus TPM

1

Entering edit mode

Ina Hoeschele ▴ 620

@ina-hoeschele-2992

Last seen 2.7 years ago

United States

Hi, when I use tximport to summarize to the gene level, running it both with the default of countsFromAbundance (to get counts) and with countsFromAbundance="lengthScaledTPM", then the count matrices (second component of the output list) from both runs are very similar (but not identical) and the library sizes (colSums) appear to be the same (produce the same summary statistics values), see below.

summary(colSums(counts))             # from countsFromAbundance default
#     Min.     1st Qu.     Median        Mean      3rd Qu.      Max. 
# 54,676,316 130,612,505 142,882,209 144,735,800 158,026,630 297,167,159

summary(colSums(TPM))  
#     Min.     1st Qu.     Median        Mean      3rd Qu.      Max. 
# 54,676,316 130,612,505 142,882,209 144,735,800 158,026,630 297,167,159

Can you please explain this, i.e. how is length-scaled TPM computed? I expect length-scaled TPM to be calculated as follows:

Divide the read counts by the length of each gene in kilobases to obtain reads per kilobase (RPK).
Sum all RPK values in a sample and divide this number by 1,000,000 to obtain the (per million) scaling factor for the sample.
Divide the RPK values by the scaling factor to obtain TPM.

Based on this I do not see why my TPM values are so similar to the (simple) counts? I have assumed that your length-scaled TPM is the "usual" TPM as computed above?

Thank you.

tximport TPM • 2.1k views

ADD COMMENT • link updated 3.8 years ago by Michael Love 41k • written 3.8 years ago by Ina Hoeschele ▴ 620

score 0 · Answer 1 · 2020-07-08

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 10 hours ago

United States

"how is length-scaled TPM computed?"

First see one of: ?tximport, or the 2015 publication for details on the computation. In particular, you will see why the column sums are the same.

ADD COMMENT • link 3.8 years ago Michael Love 41k

1

Entering edit mode

I take it back, only scaledTPM is described in the 2015 publication, but not lengthScaledTPM:

...summing the estimated transcript TPMs from Salmon within genes, and multiplying with the total library size in millions (scaledTPM).

But both are described in the countsFromAbundance section of ?tximport:

...to generate estimated counts using abundance estimates scaled up to library size (scaledTPM) or additionally scaled using the average transcript length over samples and the library size (lengthScaledTPM)

In lengthScaledTPM, the within-gene summed TPM is first multiplied by the average transcript length, averaged over samples, then scaled up to the per-sample library size. This should be closer to the original count compared to scaledTPM, but it has been corrected for changes in average transcript length across samples.

ADD REPLY • link 3.8 years ago Michael Love 41k