Question

Selecting or not selecting samples before running GSVA

1

Entering edit mode

hsiowa2 ▴ 20

@hsiowa2-23534

Last seen 14 months ago

South Korea

Hello.

Let's suppose I have gene expression table consisting of 100 samples.

The samples can be grouped as A, B, C, and D. I want to compare the pathway scores between A, B, and D.

Here's my two options : (i) run GSVA to whole 100 samples, and compare the GSVA scores between A, B, and D (ii) extract the A, B, D samples first, and then run the GSVA to that filtered table, followed by comparison of GSVA scores

Would these two give any differences? In addition, since I'm going to run statistical tests (such as ANOVA) when comparing the GSVA scores between A, B, and D, might those two options even affect statistical significance?

As I understand, unlike ssGSEA which may not care about other samples when calculating, GSVA accounts for the other samples too, so the two methods may have difference.

Thank you.


# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )

GSVA • 2.3k views

ADD COMMENT • link updated 4.0 years ago by Robert Castelo ★ 3.4k • written 4.0 years ago by hsiowa2 ▴ 20

score 1 · Answer 1 · 2020-12-22

hi,

Would these two give any differences?

yes, because as described in the paper: "GSVA starts by evaluating whether a gene i is highly or lowly expressed in sample j in the context of the sample population distribution" (please read the rest of the paragraph for full details on this), which means that depending on the population of samples you give, the notion of a gene being highly or lowly expressed may slightly change (see Fig. 1 in Zilliox and Irizarry (2007) for more insight into why you may need a population of samples to decide whether a gene is highly or lowly expressed in a high-throughput experiment).

might those two options even affect statistical significance?

this is difficult to answer, it depends on the dataset.

my advice is that if you have a samples from a common sequencing experiment, put them all in because the more samples you have, the better informed will be the expression statistic of the GSVA method.

you are right that ssGSEA in principle does not take the population of samples into account. however, the original proposal of the method in the article by Barbie et al. (2009) says the following: "Signature values were normalized using the entire set of 128 lung adenocarcinomas and 17 normal lung specimens.", which means that up to some scale, the values may change from dataset to dataset if you use the normalization step that was originally proposed for the ssGSEA method. in the GSVA package you may switch off that normalization step by setting the argument ssgsea.norm=FALSE and then ssgsea will truly operate using the information of each individual sample exclusively.

cheers,

robert.