Hi, when I use tximport to summarize to the gene level, running it both with the default of countsFromAbundance (to get counts) and with countsFromAbundance="lengthScaledTPM", then the count matrices (second component of the output list) from both runs are very similar (but not identical) and the library sizes (colSums) appear to be the same (produce the same summary statistics values), see below.
summary(colSums(counts)) # from countsFromAbundance default
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 54,676,316 130,612,505 142,882,209 144,735,800 158,026,630 297,167,159
summary(colSums(TPM))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 54,676,316 130,612,505 142,882,209 144,735,800 158,026,630 297,167,159
Can you please explain this, i.e. how is length-scaled TPM computed? I expect length-scaled TPM to be calculated as follows:
- Divide the read counts by the length of each gene in kilobases to obtain reads per kilobase (RPK).
- Sum all RPK values in a sample and divide this number by 1,000,000 to obtain the (per million) scaling factor for the sample.
- Divide the RPK values by the scaling factor to obtain TPM.
Based on this I do not see why my TPM values are so similar to the (simple) counts? I have assumed that your length-scaled TPM is the "usual" TPM as computed above?
Thank you.
I take it back, only scaledTPM is described in the 2015 publication, but not lengthScaledTPM:
But both are described in the
countsFromAbundance
section of?tximport
:In lengthScaledTPM, the within-gene summed TPM is first multiplied by the average transcript length, averaged over samples, then scaled up to the per-sample library size. This should be closer to the original count compared to scaledTPM, but it has been corrected for changes in average transcript length across samples.