Hi, when I use tximport to summarize to the gene level, running it both with the default of countsFromAbundance (to get counts) and with countsFromAbundance="lengthScaledTPM", then the count matrices (second component of the output list) from both runs are very similar (but not identical) and the library sizes (colSums) appear to be the same (produce the same summary statistics values), see below.
summary(colSums(counts)) # from countsFromAbundance default # Min. 1st Qu. Median Mean 3rd Qu. Max. # 54,676,316 130,612,505 142,882,209 144,735,800 158,026,630 297,167,159 summary(colSums(TPM)) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 54,676,316 130,612,505 142,882,209 144,735,800 158,026,630 297,167,159
Can you please explain this, i.e. how is length-scaled TPM computed? I expect length-scaled TPM to be calculated as follows:
- Divide the read counts by the length of each gene in kilobases to obtain reads per kilobase (RPK).
- Sum all RPK values in a sample and divide this number by 1,000,000 to obtain the (per million) scaling factor for the sample.
- Divide the RPK values by the scaling factor to obtain TPM.
Based on this I do not see why my TPM values are so similar to the (simple) counts? I have assumed that your length-scaled TPM is the "usual" TPM as computed above?