What lengths are used to calculate TPM in "scater" package
Hi,

I'm trying to calculate TPM for raw count bulk RNAseq data. Does anyone know how do lengths of transcripts are retrieved when using the function "calculateTPM"? Or if lengths should be provided, how do I calculate them? ref: https://www.rdocumentation.org/packages/scater/versions/1.0.4/topics/calculateTPM

In the help file ?calculateTPM you'll see an argument

lengths: Numeric vector providing the effective length for each
feature in ‘x’. Alternatively ‘NULL’, see Details.


As for how to retrieve lengths? Use information from a gene annotation source - ensemble/biomart/TxDb etc.

Ideally, you would have a a transcript length per gene, per sample. Something like what RSEM outputs. But, if it's single-cell data, it won't be possible for most kinds of sequencing protocols that sequence only the 5' end or the 3' end.

What Alan and Dario said. You will have to get your own lengths, there's no way for scater to know what annotation (or what version of it) you're using. There is a worked example here of using AnnotationHub resources to get the exonic lengths via AH73905, which - IIRC - is Ensembl GRCm38 version 97.