For GSVA scoring on RNAseq data, the authors recommend to use 'counts' as input data (with kcdf="Poisson"), but also briefly mention the options to use logCPM, logTPM or logRPKM (with kcdf="Gaussian") as input. Since the first step in the GSVA scoring algorithm is to rank the genes by their expression level, I was wondering why it was not preferred to use an RNAseq unit that is gene-length normalised, e.g. TPM or RPKM/FPKM, as opposed to using 'counts' or CPM?
Since 'counts' and CPM per default are not gene-length normalised, the gene ranking would be affected by the gene-length and not only reflect the expression level of a gene? Upon closer reading of Haenzelmann et al 2013, it is mentioned in the methods that 'counts' were indeed adjusted for gene-length and GC content, using the 'cqn' package. However, this point is not really highlighted in the GSVA vignette? Thus, is it still preferred to use 'gene-length normalised counts' over RPKM/FPKM/TPM and why?
And finally, why is it recommended to use log transformed units in all instances (logCOUNTS, logCPM, logTPM or logRPKM)?
the GSVA documentation does not make any recommendation as to what gene expression unit you should be using. It just tries to address the question about what should be the value for the argument 'kcdf'. More concretely, the help page says:
By default, ‘kcdf="Gaussian"’ which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to ‘kcdf="Poisson"’.
The documentation mentions these expression units as example of continuous measures of expression, as opposed to integer counts. You can use unlogged quantities with the method='gsva' if you want, since the first step in the GSVA algorithm that calculates expression-level statistics is non-parametric.
Unrelated to GSVA, note that the use of RPKMs/FPKMs is discouraged in general, you can find a recent discussion in this thread.