Search
Question: GSEA-like enrichment score calculation
1
2.3 years ago by
Matthias Z.20
University Medical Center of Münster, Germany
Matthias Z.20 wrote:

Hello everybody,

I would like to calculate and plot the enrichment score of a selection from a ranked list, as many readers are familar with the gene set enrichment plots produced by the Broad Institute's GSEA implementation.

While the basic principle is clear to me, I failed to understand how the magnitude is determined, by which the running sum statistic is increased or decreased. The paper and the FAQ only say "the magnitude of the increment depends on the correlation of the gene with the phenotype" and a otherwise very helpful third-party site also doesn't explain how it is calculated.

Please find a working example code to illustrate my current approach below in case you would like to try it out for yourself. However to help me an explanation or formula to calculate the enrichment score would already suffice, since my current result doesn't look similar to the enrichment score of GSEA plots at all. Thanks a lot for your help.

Matthias

testvalues <- rweibull(1e4,shape=1,scale=2) + rnorm(1e4)
testdata <- data.frame(
rank = rank(testvalues,ties.method = "random"),
value = testvalues,
random_set = sample(
c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.02,0.98)
),
enriched_set = (  # place random TRUES, but only in the upper 50% range
rank(testvalues,ties.method = "random") > 5e3 &
sample(
c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.05,0.95)
)
)
)

# order dataset and reverse ranks (so biggest value is ranked first)
testdata <- testdata[order(testdata[,"rank"]),]
testdata[,"rank"] <- rev(testdata[,"rank"])
testdata <- rev(testdata)

# fruitless attempt to calculate someting like an enrichment score
testdata[,c("ES_random_set","ES_enriched_set")] <-
apply(testdata[,c("random_set","enriched_set")],2,function(x) {
enscore <- as.numeric(x)
# normalize magnitutde for selection size
enscore[enscore == 0] <- (-1 / (length(x) / sum(x) - 1))
return(round(cumsum(enscore),3)*-1)
})

### plot the wannabe enrichment score

plot_es <- function(plotdata,set,es){
library("ggplot2")
library("gridExtra")

curve <- ggplot(plotdata,aes_string(x="rank", y="value")) +
geom_line() + geom_point(data=plotdata[plotdata[,set],],color="red",size=5,shape=124)
esplot <- ggplot(plotdata,aes_string(x="rank", y=es)) + geom_line()
return(grid.arrange(curve,esplot,nrow=2,heights=c(3,1)))
}

plot_es(testdata,"random_set","ES_random_set")
plot_es(testdata,"enriched_set","ES_enriched_set")

modified 2.3 years ago by Gordon Smyth35k • written 2.3 years ago by Matthias Z.20
4
2.3 years ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

This is GSEA-like rather than GSEA:

library(limma)
barcodeplot(testdata$value, testdata$random_set)
barcodeplot(testdata$value, testdata$enriched_set)

Hello Gordon,

Thanks a lot for your answer, this indeed looks tremendously helpful. It seems I was trying to reinvent the wheel here.

I will probably be able to use the barcodeplot right away for my project or can at least dissect the function. Anyway this should clearly solve my problems.

Best regards

Matthias

3
2.3 years ago by
Simon Anders3.5k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.5k wrote:

My understanding is that the increments and decrements are chosen such that you end up where you started, at 0.

Let's say, your ranked list contains 10,000 genes and your gene set has 200 gene, and you go up by a each time you encounter a gene in the set, then down by b in all the other steps. For the sum to be zero once you have gone through the list, you need to have

a * 200 - b * 9,800 = 0,

i.e., you could use a=1 and b=200/9800. This is, if I recall correctly, what GSEA does.

Also, if I recall correctly (but please double-check; it's late here and I might be wrong): The maximum absolute value that you encounter while going through the lsi, the statistic they use, is just the Kolmogorov-Smirnov (KS) statistic, i.e., the original Mootha et al. paper essentially described a KS test without calling it that way. The difference, however, is that they don't use the KS distribution but a sample permutation scheme to get values. This last part is because the KS test corresponds to permutations of genes, not of samples, and the latter seems more appropriate.

Edit: No need to double-check; I have just googled myself. See this paper for a more competent discussion than my blabbing:

Irizarry, Zhu, Wang, and Speed (2009): Gene set enrichment analysis made easy. Stat Methods Med Res. 2009 Dec; 18(6): 565–575. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3134237/

Hello Simon,

Thanks a lot for your detailed and educational explanation and of course the link to the helpful paper with the formula I was looking for. Luckily the whole significance testing and comparison of various sets is not relevant for my project at the moment, but good to know anyway.

Best regards

Matthias

1
2.3 years ago by
European Union

Why don't you simply run a t-test on the values annotated with a particular term, versus those not annnotated with a particular term? This approach is taken by e.g. Boorsma et al. 2005. Much simpler, cleaner and faster than GSEA's convoluted statistic, IMHO. You can of course use mann-withney-wilcoxon. For finding changes that cancel each other (i.e., an increase in variance), I would advocate the same approach but using the Browne-Forsythe test (see lawstat::levene.test). Be sure to do appropriate multiple testing correction.

Hello Phillip,

Thanks for your anwer. I will go with Gordons suggestion here, but I am sure your suggested tests will be of use for future problems.

Best regards

Matthias

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.