For GSVA scoring on RNAseq data, the authors recommend to use 'counts' as input data (with kcdf="Poisson"), but also briefly mention the options to use logCPM, logTPM or logRPKM (with kcdf="Gaussian") as input. Since the first step in the GSVA scoring algorithm is to rank the genes by their expression level, I was wondering why it was not preferred to use an RNAseq unit that is gene-length normalised, e.g. TPM or RPKM/FPKM, as opposed to using 'counts' or CPM?
Since 'counts' and CPM per default are not gene-length normalised, the gene ranking would be affected by the gene-length and not only reflect the expression level of a gene? Upon closer reading of Haenzelmann et al 2013, it is mentioned in the methods that 'counts' were indeed adjusted for gene-length and GC content, using the 'cqn' package. However, this point is not really highlighted in the GSVA vignette? Thus, is it still preferred to use 'gene-length normalised counts' over RPKM/FPKM/TPM and why?
And finally, why is it recommended to use log transformed units in all instances (logCOUNTS, logCPM, logTPM or logRPKM)?
Thank you for any comments!