Question: GSVA: enrichment score bounds and transformations
gravatar for SB
12 months ago by
United States
SB0 wrote:

Dear Bioconductors, 

It looks like the GSVA enrichment scores are bound between -1 and 1. Is this true? If so, is this ensured in step 3?

Lastly, though I set the scores to be distributed as the difference between the largest positive and negative deviations, the output was not normally distributed (potentially because I have a relatively small number of samples). Thus, I was planning to transform the values using Fisher transformation. Does this seem like an appropriate follow-up approach to make the scores more normally distributed so that I can perform a Two-Way ANOVA test or would this over-transform the scores to point that they are perhaps less meaningful representations of the data? On a side note, this transformation only worked for one of my expression sets. Are there any other transformations you'd recommend for enrichment scores?

I appreciate any advice you can offer.

Thank you,


gsva • 273 views
ADD COMMENTlink modified 6 months ago • written 12 months ago by SB0
Answer: GSVA: enrichment score bounds and enrichment score transformations
gravatar for Robert Castelo
12 months ago by
Robert Castelo2.3k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.3k wrote:

Hi Sarah,

GSVA enrichment scores, as defined by Eq. (5) in the GSVA article, are bound between -1 and 1 because they arise from the difference between two empirical cumulative distribution functions (ECDFs), which themselves, as such, are bound between 0 and 1.

you mention that your GSVA scores are not normally distributed because you have relatively few samples. please keep in mind that in the discussion of the GSVA article, we specifically give the following advice: "The user should also be aware that the non-parametric density estimation within the GSVA algorithm requires a sufficient number of observations which, according to our analysis of statistical power in Figure 2, should be larger than n=10".

on the other hand, if you have few samples, it does not matter so much whether the numbers are GSVA scores or anything else, it will be difficult to have them normally distributed. the problem you may have doing a two-way ANOVA with few observations is that the estimates of variability are going to be very unstable and for that reason people uses methods such as limma, which are specifically tailored for limited replication; see this other A: Are published RNA seq data analyses often wrong in calculating p-values and FDR? for an example about this. I'm not familiar with any transformation that will turn a few data points into normally distributed data and I can't imagine how that may lead to meaningful results.



ADD COMMENTlink written 12 months ago by Robert Castelo2.3k

Thank you for the response. If I have a data set with a total of 16 samples (4 per treatment group) then would you say that the non-parametric density estimation would not produce reliable results? I was confused in the article as to whether sample means "observations" (i.e. samples per treatment group) or the total number of samples in the data set.

ADD REPLYlink modified 6 months ago • written 11 months ago by SB0

Whether results are reliable depends on many factors, probably most of them not related to GSVA. The recommendation about sample size refers to a minimum sample size on the total number of samples in the data set, because indeed, density estimation is performed across all samples regardless of sample type/class.

ADD REPLYlink written 11 months ago by Robert Castelo2.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 231 users visited in the last hour