Question

GSEA-like enrichment score calculation

1

Entering edit mode

Matthias Z. ▴ 20

@matthias-z-8768

Last seen 8.5 years ago

University Medical Center of Münster, G…

Hello everybody,

I would like to calculate and plot the enrichment score of a selection from a ranked list, as many readers are familar with the gene set enrichment plots produced by the Broad Institute's GSEA implementation.

While the basic principle is clear to me, I failed to understand how the magnitude is determined, by which the running sum statistic is increased or decreased. The paper and the FAQ only say "the magnitude of the increment depends on the correlation of the gene with the phenotype" and a otherwise very helpful third-party site also doesn't explain how it is calculated.

Please find a working example code to illustrate my current approach below in case you would like to try it out for yourself. However to help me an explanation or formula to calculate the enrichment score would already suffice, since my current result doesn't look similar to the enrichment score of GSEA plots at all. Thanks a lot for your help.

Matthias

testvalues <- rweibull(1e4,shape=1,scale=2) + rnorm(1e4)
testdata <- data.frame(
  rank = rank(testvalues,ties.method = "random"),
  value = testvalues,
  random_set = sample(
    c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.02,0.98)
  ),
  enriched_set = (  # place random TRUES, but only in the upper 50% range
    rank(testvalues,ties.method = "random") > 5e3 &
      sample(
        c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.05,0.95)
      )
  )
)

# order dataset and reverse ranks (so biggest value is ranked first)
testdata <- testdata[order(testdata[,"rank"]),]
testdata[,"rank"] <- rev(testdata[,"rank"])
testdata <- rev(testdata)

# fruitless attempt to calculate someting like an enrichment score
testdata[,c("ES_random_set","ES_enriched_set")] <-
  apply(testdata[,c("random_set","enriched_set")],2,function(x) {
    enscore <- as.numeric(x)
    # normalize magnitutde for selection size
    enscore[enscore == 0] <- (-1 / (length(x) / sum(x) - 1))
    return(round(cumsum(enscore),3)*-1)
  })

### plot the wannabe enrichment score

plot_es <- function(plotdata,set,es){
  library("ggplot2")
  library("gridExtra")
 
  curve <- ggplot(plotdata,aes_string(x="rank", y="value")) +
    geom_line() + geom_point(data=plotdata[plotdata[,set],],color="red",size=5,shape=124)
  esplot <- ggplot(plotdata,aes_string(x="rank", y=es)) + geom_line()
  return(grid.arrange(curve,esplot,nrow=2,heights=c(3,1)))
}

plot_es(testdata,"random_set","ES_random_set")
plot_es(testdata,"enriched_set","ES_enriched_set")

GSEA enrichment analysis gseabase • 7.6k views

ADD COMMENT • link updated 9.7 years ago by Gordon Smyth 53k • written 9.7 years ago by Matthias Z. ▴ 20

1

Entering edit mode

Philip Lijnzaad ▴ 160

@philip-lijnzaad-2499

Last seen 3.6 years ago

European Union

Why don't you simply run a t-test on the values annotated with a particular term, versus those not annnotated with a particular term? This approach is taken by e.g. Boorsma et al. 2005. Much simpler, cleaner and faster than GSEA's convoluted statistic, IMHO. You can of course use mann-withney-wilcoxon. For finding changes that cancel each other (i.e., an increase in variance), I would advocate the same approach but using the Browne-Forsythe test (see lawstat::levene.test). Be sure to do appropriate multiple testing correction.

ADD COMMENT • link 9.7 years ago Philip Lijnzaad ▴ 160

0

Entering edit mode

Hello Phillip,

Thanks for your anwer. I will go with Gordons suggestion here, but I am sure your suggested tests will be of use for future problems.

Best regards

Matthias

ADD REPLY • link 9.7 years ago Matthias Z. ▴ 20

score 4 · Accepted Answer · 2016-06-15

4

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

This is GSEA-like rather than GSEA:

library(limma)
barcodeplot(testdata$value, testdata$random_set)
barcodeplot(testdata$value, testdata$enriched_set)

Barcode plots of random and enriched sets.

ADD COMMENT • link 9.7 years ago Gordon Smyth 53k

0

Entering edit mode

Hello Gordon,

Thanks a lot for your answer, this indeed looks tremendously helpful. It seems I was trying to reinvent the wheel here.

I will probably be able to use the barcodeplot right away for my project or can at least dissect the function. Anyway this should clearly solve my problems.

Best regards

Matthias

ADD REPLY • link 9.7 years ago Matthias Z. ▴ 20

score 3 · Accepted Answer · 2016-06-15

My understanding is that the increments and decrements are chosen such that you end up where you started, at 0.

Let's say, your ranked list contains 10,000 genes and your gene set has 200 gene, and you go up by a each time you encounter a gene in the set, then down by b in all the other steps. For the sum to be zero once you have gone through the list, you need to have

a * 200 - b * 9,800 = 0,

i.e., you could use a=1 and b=200/9800. This is, if I recall correctly, what GSEA does.

Also, if I recall correctly (but please double-check; it's late here and I might be wrong): The maximum absolute value that you encounter while going through the lsi, the statistic they use, is just the Kolmogorov-Smirnov (KS) statistic, i.e., the original Mootha et al. paper essentially described a KS test without calling it that way. The difference, however, is that they don't use the KS distribution but a sample permutation scheme to get p values. This last part is because the KS test corresponds to permutations of genes, not of samples, and the latter seems more appropriate.

Edit: No need to double-check; I have just googled myself. See this paper for a more competent discussion than my blabbing:

Irizarry, Zhu, Wang, and Speed (2009): Gene set enrichment analysis made easy. Stat Methods Med Res. 2009 Dec; 18(6): 565–575. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3134237/