Dear ssGSEA users,
In ssGSEA, gene expression values for a given sample are rank-normalized, and an enrichment score is produced using the Empirical Cumulative Distribution Functions (ECDF) of the genes in the signature and the remaining genes [Barbie et al., 2009]. The part that I'm having a hard time wrapping my head around is, how can we replace the genes by their ranks according to the their absolute expression, without accounting for inherent differences in dynamic range of expression?
Assume we have 4 samples and 3 genes:
s1(ctrl) s2(ctrl) s3(tx) s4(tx)
-------------------------------------------------------------
A: 2.6 2.8 2.1 2.2
B: 4.5 4.8 3.2 4
C: 5.1 6.2 8 7.5
Each gene has a different dynamic range, for example: gene A varies between [2,3], gene B varies between [3,5] and gene C varies between [5,8]; See below for a real-life example of 20 randomly selected genes in a microarray data set with 40 arrays for example.
Now in the example above, if for each sample we convert expression values into ranks, we'll have:
s1(ctrl) s2(ctrl) s3(tx) s4(tx)
-------------------------------------------------------------
A: 3 3 3 3
B: 2 2 2 2
C: 1 1 1 1
Let's further assume that our gene set of interest is {A,B}. As demonstrated in this example, even though, A and B are over-expressed in ctrl samples, i.e. s1 and s2 are enriched for the gene set of interest, conversion to rank makes gene C with higher dynamic range the 1st rank, gene B the 2nd rank, and gene C the 3rd. Therefore, variation in ranges of expression appears to mask gene set enrichment.
I'm amazed how well ssGSEA performs in practice (see here for example for an extensive numerical and experimental validation) considering this limitation. Looking forward to your input!
I think you have a point here.
But then I wonder, what happens if we normalize gene expressions (i.e. zscore)? Therefore, all genes will have zero mean and unit standard deviation. Now each value represents how far that gene (in a sample/patient) is from the population mean of that particular gene. I can not see the mentioned problem with such a data set. Am I missing something Sina?
Edit: Found it: It seems that the authors do normalize their samples before performing the ranking. Please take a look at:
https://bioconductor.org/packages/release/bioc/manuals/GSVA/man/GSVA.pdf (page 6)
Quoting:
runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper.
I hope this helps.
Sorry for the super late reply. My question was directly about the ssGSEA algorithm of Barbie et al. The first normalization that you're referring to, is the normalization prior to ranking via ecdf as proposed by Hänzelmann et al and implemented in the GSVA package. This normalization does address my concern, but only applies to GSVA algorithm, i.e. when
method="gsva"
. Whenmethod="ssgsea"
, this step is skipped. The second normalization that you're referring to, is the normalization proposed by Barbie et al in the original ssGSEA paper, also implemented in GSVA. However, normalizing enrichment scores by their range (i.e. max - min) is performed after the scores are already computed, and therefore does not address the potential issue with dynamic range that was raised in my question.Replying after ~6 years, sorry!
Thanks for the clarification @sina.nassiri :)
I might be completely off now but after checking the Barbie et al. 2009 paper, it seems that they initially normalize their samples, then rank the genes, and then calculate the sample-specific enrichment using the ranked genes (see the "RIGER analysis" in the "Methods" section of the Barbie et al. 2009 paper). In that case, I guess the problem is resolved. Do you agree?
I feel this concern is valid when there is only one sample to study - there is no reference to compare. But when there are groups of samples, one can focus on changes in enrichment score; no matter where in the rank list these gene set members are located, if they tend to be upregulated, their ranks would be raised and the enriment score would be larger.
in this situation, why would you use ssGSEA and not GSEA? ssGSEA is going to treat each sample/colum independently
GSEA requires two phenotypes to be specified (e.g. healthy vs disease). That isn't necessarily the case in this example - the OP may be interested in relative enrichment of a given signature across a patient cohort, for example. I think the suggestion that there are control and case samples is purely given to help us understand the situation - they aren't necessarily labels we know a priori.