hi,
If your expression data is continuous, then you should be fine with the default settings of the gsva()
function. If you want to understand a bit more the options, then you should check at least the manual page for gsva()
, which says the following for the kcdf
argument:
kcdf: Character string denoting the kernel to use during the
non-parametric estimation of the cumulative distribution
function of expression levels across samples when
‘method="gsva"’. By default, ‘kcdf="Gaussian"’ which is
suitable when input expression values are continuous, such as
microarray fluorescent units in logarithmic scale, RNA-seq
log-CPMs, log-RPKMs or log-TPMs. When input expression
values are integer counts, such as those derived from RNA-seq
experiments, then this argument should be set to
‘kcdf="Poisson"’.
which means that it is only relevant when method="gsva"
, its default value, while it says the following about the method
argument:
method: Method to employ in the estimation of gene-set enrichment
scores per sample. By default this is set to ‘gsva’
(Hänzelmann et al, 2013) and other options are ‘ssgsea’
(Barbie et al, 2009), ‘zscore’ (Lee et al, 2008) or ‘plage’
(Tomfohr et al, 2005). The latter two standardize first [...]
and at the end of the help page you can find the cited references. So, you can check those references to decide what method do you think is more appropriate for your data. The GSVA paper contains a comparison between them and the recommendation of the paper is obviously to use method="gsva"
. However, you can try each of them in your data and decide by yourself which one gives you more sensible results. In general terms, PLAGE and z-score are parametric and should perform well with close-to-Gaussian expression profiles, and ssGSEA and GSVA are non-parametric and more robust to departures of Gaussianity in gene expression data.