Question: Single Sample GSEA (ssGSEA) and dynamic range of expression
1
gravatar for sina.nassiri
22 months ago by
sina.nassiri70
Switzerland/Lausanne
sina.nassiri70 wrote:

Dear ssGSEA users,

In ssGSEA, gene expression values for a given sample are rank-normalized, and an enrichment score is produced using the Empirical Cumulative Distribution Functions (ECDF) of the genes in the signature and the remaining genes [Barbie et al., 2009]. The part that I'm having a hard time wrapping my head around is, how can we replace the genes by their ranks according to the their absolute expression, without accounting for inherent differences in dynamic range of expression?

Assume we have 4 samples and 3 genes:

        s1(ctrl)      s2(ctrl)        s3(tx)       s4(tx)

-------------------------------------------------------------

A:      2.6            2.8             2.1            2.2

B:      4.5            4.8             3.2             4  

C:      5.1            6.2               8             7.5

Each gene has a different dynamic range, for example: gene A varies between [2,3], gene B varies between [3,5]  and gene C varies between [5,8]; See below for a real-life example of 20 randomly selected genes in a microarray data set with 40 arrays for example.

Now in the example above, if for each sample we convert expression values into ranks, we'll have:

        s1(ctrl)      s2(ctrl)        s3(tx)       s4(tx)

-------------------------------------------------------------

A:        3              3                3               3

B:        2              2                2               2  

C:        1              1                1               1

Let's further assume that our gene set of interest is {A,B}. As demonstrated in this example, even though, A and B are over-expressed in ctrl samples, i.e. s1 and s2 are enriched for the gene set of interest, conversion to rank makes gene C with higher dynamic range the 1st rank, gene B the 2nd rank, and gene C the 3rd. Therefore, variation in ranges of expression appears to mask gene set enrichment.

I'm amazed how well ssGSEA performs in practice (see here for example for an extensive numerical and experimental validation) considering this limitation. Looking forward to your input!

gsva piano ssgsea egsea • 5.6k views
ADD COMMENTlink modified 6 weeks ago by gabriel.rosser20 • written 22 months ago by sina.nassiri70

I think you have a point here.

But then I wonder, what happens if we normalize gene expressions (i.e. zscore)? Therefore, all genes will have zero mean and unit standard deviation. Now each value represents how far that gene (in a sample/patient) is from the population mean of that particular gene. I can not see the mentioned problem with such a data set. Am I missing something Sina?

 

Edit: Found it: It seems that the authors do normalize their samples before performing the ranking. Please take a look at:

https://bioconductor.org/packages/release/bioc/manuals/GSVA/man/GSVA.pdf (page 6)

Quoting:

runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper.

I hope this helps.

 

ADD REPLYlink modified 21 months ago • written 21 months ago by aminbys0

Sorry for the super late reply. My question was directly about the ssGSEA algorithm of Barbie et al. The first normalization that you're referring to, is the normalization prior to ranking via ecdf as proposed by Hänzelmann et al and implemented in the GSVA package. This normalization does address my concern, but only applies to GSVA algorithm, i.e. when method="gsva". When method="ssgsea", this step is skipped. The second normalization that you're referring to, is the normalization proposed by Barbie et al in the original ssGSEA paper, also implemented in GSVA. However, normalizing enrichment scores by their range (i.e. max - min) is performed after the scores are already computed, and therefore does not address the potential issue with dynamic range that was raised in my question.

ADD REPLYlink written 17 months ago by sina.nassiri70

in this situation, why would you use ssGSEA and not GSEA? ssGSEA is going to treat each sample/colum independently

ADD REPLYlink written 9 months ago by lm6870

GSEA requires two phenotypes to be specified (e.g. healthy vs disease). That isn't necessarily the case in this example - the OP may be interested in relative enrichment of a given signature across a patient cohort, for example. I think the suggestion that there are control and case samples is purely given to help us understand the situation - they aren't necessarily labels we know a priori.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by gabriel.rosser20
Answer: Single Sample GSEA (ssGSEA) and dynamic range of expression
0
gravatar for gabriel.rosser
6 weeks ago by
gabriel.rosser20 wrote:

Returning to this old question because, IMO, it's still relevant!

This is a very interesting point and example. I wonder if the reason ssGSEA works well (without the additional normalisation discussed in other comments) is that in reality there are a very large number of genes that won't have large variations across the samples. These will act as 'spacers' to increase the difference in ranks. Therefore, A and B will have ranks in the 1000s in the control samples and 100s in the tx samples and the enrichment will become apparent?

The issue you raise would then be expected to become more substantial as the size of the signature increases relative to the total number of genes. However, I'm not aware that ssGSEA was ever designed for this regime - and would imagine that the power to assess enrichment would be low in this case?

ADD COMMENTlink written 6 weeks ago by gabriel.rosser20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 397 users visited in the last hour