Question: dtuScaledTPM vs lengthScaledTPM in DTU analysis
0
6 months ago by
fiona.dick9110
fiona.dick9110 wrote:

Hi,

I have a question concerning the "new" scaling method offered by tximport called dtuScaledTPM and how it affects DTU analysis. So far I have used scaledTPM which as I understand (correct me if Im wrong) scales the TPM values to library size by multiplying the TPM of a transcript of a sample with the column sum of the count matrix and thus brings them back onto count scale? The dtuScaledTPM additionally includes the transcript length into the library size info, (dividing count based library size by library size calculated from TPM*transcript length). And the transcript length is the median of transcript lengths of all transcripts in a gene ( where the transcript length itself is the average across all samples). Is this correct? And if so, why is this beneficial for DTU analysis. I understand that using lengthScaledTPM is not advantageous but I cant wrap my head around why this method is better "just" because the transcript length value is the median instead of the mean?

Sorry if this is a rather confusing question. Im seeking to understand why this method should be used for DTU . It would be great to get a worst case toy example where lengthScaledTPM would not work and also where lack of length scaling in scaledTPM would not work well.

Fiona

salmon tpm tximport dtu rnaseqdtu • 197 views
modified 6 months ago by Michael Love25k • written 6 months ago by fiona.dick9110
Answer: dtuScaledTPM vs lengthScaledTPM in DTU analysis
1
6 months ago by
Michael Love25k
United States
Michael Love25k wrote:

I understand that this is tricky. It took us a while to work this out, and then a few iterations to explain it in the paper as well. Ok, so if you understand the problem with lengthScaledTPM and why we can instead use scaledTPM then we're already quite far along. (For others reading this thread, see the rnaseqDTU workflow or paper, and the section that starts with "For DTU analysis, we suggest generating counts from abundance, using the scaledTPM method...".

The idea behind dtuScaledTPM is that, the scaledTPM do not reflect the fact that, in general, long genes will have higher counts, and so higher precision. We may want to get a bit closer to the original count scale for each transcript, while still avoiding the problem with original counts and lengthScaledTPM. So we can, within each gene, multiply all the TPM estimates by a single value: the median over genes of the mean effective transcript length over samples. Since all TPMs for all samples within a gene are scaled by the same value, we avoid the issue discussed in the workflow that occurs with original counts or lengthScaledTPM.

thanks for taking the time to answer. so the main difference to lengthScaledTPM is, that you take the average overall samples:

"Since all TPMs for all samples within a gene are scaled by the same value"

and so you do not unpuprosley set down the whole gene expression, in one sample (groupA) and not the other (groupB). Hence you do not divide by a different total gene expression amount when calculating the proportions. (as opposed to what happens when using lengthScaledTPM, which is wanted for DGE)

And the main argument for including transcript length scaling (in DTU) basically cooks down to, preventing to confound the low dispersion (or high precision) for long transcripts "just" because they generally have higher counts (due to their length).

Is that right?

With dtuScaledTPM, and comparing to scaledTPM, we want that long transcripts tend to have counts in the range similar to their original counts. Actually the precision has more to do with the Poisson component than the dispersion component (if we are considering a Gamma Poisson for modeling the data).