For GSVA scoring on RNAseq data, the authors recommend to use 'counts' as input data (with kcdf="Poisson"), but also briefly mention the options to use logCPM, logTPM or logRPKM (with kcdf="Gaussian") as input. Since the first step in the GSVA scoring algorithm is to rank the genes by their expression level, I was wondering why it was not preferred to use an RNAseq unit that is gene-length normalised, e.g. TPM or RPKM/FPKM, as opposed to using 'counts' or CPM?
Since 'counts' and CPM per default are not gene-length normalised, the gene ranking would be affected by the gene-length and not only reflect the expression level of a gene? Upon closer reading of Haenzelmann et al 2013, it is mentioned in the methods that 'counts' were indeed adjusted for gene-length and GC content, using the 'cqn' package. However, this point is not really highlighted in the GSVA vignette? Thus, is it still preferred to use 'gene-length normalised counts' over RPKM/FPKM/TPM and why?
And finally, why is it recommended to use log transformed units in all instances (logCOUNTS, logCPM, logTPM or logRPKM)?
the GSVA documentation does not make any recommendation as to what gene expression unit you should be using. It just tries to address the question about what should be the value for the argument 'kcdf'. More concretely, the help page says:
By default, ‘kcdf="Gaussian"’ which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to ‘kcdf="Poisson"’.
The documentation mentions these expression units as example of continuous measures of expression, as opposed to integer counts. You can use unlogged quantities with the method='gsva' if you want, since the first step in the GSVA algorithm that calculates expression-level statistics is non-parametric.
Unrelated to GSVA, note that the use of RPKMs/FPKMs is discouraged in general, you can find a recent discussion in this thread.
Thanks for the reply and apologies for the wrong assumption that 'counts' was the recommended input unit for RNAseq data - it was based on the fact that 'counts' were used for all RNAseq examples in the package vignette and the original publication. However, the 'counts' used in those examples have always been 'normalized' by gene-length and GC content using the 'cqn' package and thus, I assume, it is indeed deemed to be important to take the gene-length into account, as it will affect the ranking of the genes? I still need to do some more research on how the 'cqn' gene-length normalised counts differ from RPKMs/FPKMs, but if the use of RPKMs/FPKMs is now being discouraged in general, what's the alternative gene-length normalised unit that is now recommended 'in general'?
We used conditional quantile normalization with the cqn package because, back then, it seemed to us the most flexible and advanced normalization method for RNA-seq data, but we did not intend to make an specific advice to the users in this respect. Whether you need to adjust for GC and/or length in normalizing RNA-seq data is a question that depends on whether you data set has such sample-specific effects. I recommend you to look at the cqn vignette and, more specifically, the cqnplot() function to diagnose possible GC content or gene-length sample-specific effects, and then make an more informed decision. In the tweeDEseqpaper I wrote with other colleagues, we did compare cqn with TMM and found, in that particular comparison now more than 5 years old, that cqn with adjustment for gene length and GC bias was performing a bit better than TMM (Fig. 10). As for what gene-length normalized expression units are recommended "in general", I would say this depends on the application, I think that for gene-level differential expression it is recommended to start from the raw counts because gene-length corrections tend to distort the size of the original count, which is important for assessing its uncertainty and variability. Probably, this recommendation is also valid for GSVA because it relies on expression level rankings, although I do not know of any benchmarking on this specific question.
Thanks for the reply and apologies for the wrong assumption that 'counts' was the recommended input unit for RNAseq data - it was based on the fact that 'counts' were used for all RNAseq examples in the package vignette and the original publication. However, the 'counts' used in those examples have always been 'normalized' by gene-length and GC content using the 'cqn' package and thus, I assume, it is indeed deemed to be important to take the gene-length into account, as it will affect the ranking of the genes? I still need to do some more research on how the 'cqn' gene-length normalised counts differ from RPKMs/FPKMs, but if the use of RPKMs/FPKMs is now being discouraged in general, what's the alternative gene-length normalised unit that is now recommended 'in general'?
We used conditional quantile normalization with the cqn package because, back then, it seemed to us the most flexible and advanced normalization method for RNA-seq data, but we did not intend to make an specific advice to the users in this respect. Whether you need to adjust for GC and/or length in normalizing RNA-seq data is a question that depends on whether you data set has such sample-specific effects. I recommend you to look at the cqn vignette and, more specifically, the cqnplot() function to diagnose possible GC content or gene-length sample-specific effects, and then make an more informed decision. In the tweeDEseq paper I wrote with other colleagues, we did compare cqn with TMM and found, in that particular comparison now more than 5 years old, that cqn with adjustment for gene length and GC bias was performing a bit better than TMM (Fig. 10). As for what gene-length normalized expression units are recommended "in general", I would say this depends on the application, I think that for gene-level differential expression it is recommended to start from the raw counts because gene-length corrections tend to distort the size of the original count, which is important for assessing its uncertainty and variability. Probably, this recommendation is also valid for GSVA because it relies on expression level rankings, although I do not know of any benchmarking on this specific question.
Thank you for the very helpful comments!