Single Sample GSEA (ssGSEA) and dynamic range of expression
1
3
Entering edit mode
sina.nassiri ▴ 110
@sinanassiri-10062
Last seen 6 months ago
Switzerland/Lausanne

Dear ssGSEA users,

In ssGSEA, gene expression values for a given sample are rank-normalized, and an enrichment score is produced using the Empirical Cumulative Distribution Functions (ECDF) of the genes in the signature and the remaining genes [Barbie et al., 2009]. The part that I'm having a hard time wrapping my head around is, how can we replace the genes by their ranks according to the their absolute expression, without accounting for inherent differences in dynamic range of expression?

Assume we have 4 samples and 3 genes:

s1(ctrl)      s2(ctrl)        s3(tx)       s4(tx)

-------------------------------------------------------------

A:      2.6            2.8             2.1            2.2

B:      4.5            4.8             3.2             4

C:      5.1            6.2               8             7.5

Each gene has a different dynamic range, for example: gene A varies between [2,3], gene B varies between [3,5]  and gene C varies between [5,8]; See below for a real-life example of 20 randomly selected genes in a microarray data set with 40 arrays for example.

Now in the example above, if for each sample we convert expression values into ranks, we'll have:

s1(ctrl)      s2(ctrl)        s3(tx)       s4(tx)

-------------------------------------------------------------

A:        3              3                3               3

B:        2              2                2               2

C:        1              1                1               1

Let's further assume that our gene set of interest is {A,B}. As demonstrated in this example, even though, A and B are over-expressed in ctrl samples, i.e. s1 and s2 are enriched for the gene set of interest, conversion to rank makes gene C with higher dynamic range the 1st rank, gene B the 2nd rank, and gene C the 3rd. Therefore, variation in ranges of expression appears to mask gene set enrichment.

I'm amazed how well ssGSEA performs in practice (see here for example for an extensive numerical and experimental validation) considering this limitation. Looking forward to your input!

ssgsea gsva piano egsea • 15k views
1
Entering edit mode

I think you have a point here.

But then I wonder, what happens if we normalize gene expressions (i.e. zscore)? Therefore, all genes will have zero mean and unit standard deviation. Now each value represents how far that gene (in a sample/patient) is from the population mean of that particular gene. I can not see the mentioned problem with such a data set. Am I missing something Sina?

Edit: Found it: It seems that the authors do normalize their samples before performing the ranking. Please take a look at:

Quoting:

runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper.

I hope this helps.

2
Entering edit mode

Sorry for the super late reply. My question was directly about the ssGSEA algorithm of Barbie et al. The first normalization that you're referring to, is the normalization prior to ranking via ecdf as proposed by Hänzelmann et al and implemented in the GSVA package. This normalization does address my concern, but only applies to GSVA algorithm, i.e. when method="gsva". When method="ssgsea", this step is skipped. The second normalization that you're referring to, is the normalization proposed by Barbie et al in the original ssGSEA paper, also implemented in GSVA. However, normalizing enrichment scores by their range (i.e. max - min) is performed after the scores are already computed, and therefore does not address the potential issue with dynamic range that was raised in my question.

1
Entering edit mode

I feel this concern is valid when there is only one sample to study - there is no reference to compare. But when there are groups of samples, one can focus on changes in enrichment score; no matter where in the rank list these gene set members are located, if they tend to be upregulated, their ranks would be raised and the enriment score would be larger.

0
Entering edit mode

in this situation, why would you use ssGSEA and not GSEA? ssGSEA is going to treat each sample/colum independently

0
Entering edit mode

GSEA requires two phenotypes to be specified (e.g. healthy vs disease). That isn't necessarily the case in this example - the OP may be interested in relative enrichment of a given signature across a patient cohort, for example. I think the suggestion that there are control and case samples is purely given to help us understand the situation - they aren't necessarily labels we know a priori.

0
Entering edit mode
@gabrielrosser-12715
Last seen 2.8 years ago

Returning to this old question because, IMO, it's still relevant!

This is a very interesting point and example. I wonder if the reason ssGSEA works well (without the additional normalisation discussed in other comments) is that in reality there are a very large number of genes that won't have large variations across the samples. These will act as 'spacers' to increase the difference in ranks. Therefore, A and B will have ranks in the 1000s in the control samples and 100s in the tx samples and the enrichment will become apparent?

The issue you raise would then be expected to become more substantial as the size of the signature increases relative to the total number of genes. However, I'm not aware that ssGSEA was ever designed for this regime - and would imagine that the power to assess enrichment would be low in this case?