tximport counts versus TPM
1
0
Entering edit mode
Ina Hoeschele ▴ 610
@ina-hoeschele-2992
Last seen 8 weeks ago
United States

Hi, when I use tximport to summarize to the gene level, running it both with the default of countsFromAbundance (to get counts) and with countsFromAbundance="lengthScaledTPM", then the count matrices (second component of the output list) from both runs are very similar (but not identical) and the library sizes (colSums) appear to be the same (produce the same summary statistics values), see below.

summary(colSums(counts))             # from countsFromAbundance default
#     Min.     1st Qu.     Median        Mean      3rd Qu.      Max.
# 54,676,316 130,612,505 142,882,209 144,735,800 158,026,630 297,167,159

summary(colSums(TPM))
#     Min.     1st Qu.     Median        Mean      3rd Qu.      Max.
# 54,676,316 130,612,505 142,882,209 144,735,800 158,026,630 297,167,159


Can you please explain this, i.e. how is length-scaled TPM computed? I expect length-scaled TPM to be calculated as follows:

1. Divide the read counts by the length of each gene in kilobases to obtain reads per kilobase (RPK).
2. Sum all RPK values in a sample and divide this number by 1,000,000 to obtain the (per million) scaling factor for the sample.
3. Divide the RPK values by the scaling factor to obtain TPM.

Based on this I do not see why my TPM values are so similar to the (simple) counts? I have assumed that your length-scaled TPM is the "usual" TPM as computed above?

Thank you.

tximport TPM • 231 views
0
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

"how is length-scaled TPM computed?"

First see one of: ?tximport, or the 2015 publication for details on the computation. In particular, you will see why the column sums are the same.

1
Entering edit mode

I take it back, only scaledTPM is described in the 2015 publication, but not lengthScaledTPM:

...summing the estimated transcript TPMs from Salmon within genes, and multiplying with the total library size in millions (scaledTPM).

But both are described in the countsFromAbundance section of ?tximport:

...to generate estimated counts using abundance estimates scaled up to library size (scaledTPM) or additionally scaled using the average transcript length over samples and the library size (lengthScaledTPM)

In lengthScaledTPM, the within-gene summed TPM is first multiplied by the average transcript length, averaged over samples, then scaled up to the per-sample library size. This should be closer to the original count compared to scaledTPM, but it has been corrected for changes in average transcript length across samples.