Selecting or not selecting samples before running GSVA
Entering edit mode
hsiowa2 ▴ 20
Last seen 17 months ago
South Korea


Let's suppose I have gene expression table consisting of 100 samples.

The samples can be grouped as A, B, C, and D. I want to compare the pathway scores between A, B, and D.

Here's my two options : (i) run GSVA to whole 100 samples, and compare the GSVA scores between A, B, and D (ii) extract the A, B, D samples first, and then run the GSVA to that filtered table, followed by comparison of GSVA scores

Would these two give any differences? In addition, since I'm going to run statistical tests (such as ANOVA) when comparing the GSVA scores between A, B, and D, might those two options even affect statistical significance?

As I understand, unlike ssGSEA which may not care about other samples when calculating, GSVA accounts for the other samples too, so the two methods may have difference.

Thank you.

# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )
GSVA • 1.3k views
Entering edit mode
Robert Castelo ★ 3.0k
Last seen 5 weeks ago
Barcelona/Universitat Pompeu Fabra


Would these two give any differences?

yes, because as described in the paper: "GSVA starts by evaluating whether a gene i is highly or lowly expressed in sample j in the context of the sample population distribution" (please read the rest of the paragraph for full details on this), which means that depending on the population of samples you give, the notion of a gene being highly or lowly expressed may slightly change (see Fig. 1 in Zilliox and Irizarry (2007) for more insight into why you may need a population of samples to decide whether a gene is highly or lowly expressed in a high-throughput experiment).

might those two options even affect statistical significance?

this is difficult to answer, it depends on the dataset.

my advice is that if you have a samples from a common sequencing experiment, put them all in because the more samples you have, the better informed will be the expression statistic of the GSVA method.

you are right that ssGSEA in principle does not take the population of samples into account. however, the original proposal of the method in the article by Barbie et al. (2009) says the following: "Signature values were normalized using the entire set of 128 lung adenocarcinomas and 17 normal lung specimens.", which means that up to some scale, the values may change from dataset to dataset if you use the normalization step that was originally proposed for the ssGSEA method. in the GSVA package you may switch off that normalization step by setting the argument ssgsea.norm=FALSE and then ssgsea will truly operate using the information of each individual sample exclusively.



Entering edit mode

Thank you for your kind and detailed answer.

Entering edit mode

In general, for questions posted at a forum like this one (a similar one could be Biostars), if you are satisfied with an answer, it's a good practice to upvote it and accept it. This not only shows appreciation for the work of developers giving support to their software, but also helps others to more easily identify questions that have been already answered. Thanks!


Login before adding your answer.

Traffic: 885 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6