Search
Question: GSEA-like enrichment score calculation
1
gravatar for Matthias Z.
17 months ago by
Matthias Z.20
University Medical Center of Münster, Germany
Matthias Z.20 wrote:

Hello everybody,

I would like to calculate and plot the enrichment score of a selection from a ranked list, as many readers are familar with the gene set enrichment plots produced by the Broad Institute's GSEA implementation.

While the basic principle is clear to me, I failed to understand how the magnitude is determined, by which the running sum statistic is increased or decreased. The paper and the FAQ only say "the magnitude of the increment depends on the correlation of the gene with the phenotype" and a otherwise very helpful third-party site also doesn't explain how it is calculated.

Please find a working example code to illustrate my current approach below in case you would like to try it out for yourself. However to help me an explanation or formula to calculate the enrichment score would already suffice, since my current result doesn't look similar to the enrichment score of GSEA plots at all. Thanks a lot for your help.

Matthias


testvalues <- rweibull(1e4,shape=1,scale=2) + rnorm(1e4)
testdata <- data.frame(
  rank = rank(testvalues,ties.method = "random"),
  value = testvalues,
  random_set = sample(
    c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.02,0.98)
  ),
  enriched_set = (  # place random TRUES, but only in the upper 50% range
    rank(testvalues,ties.method = "random") > 5e3 &
      sample(
        c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.05,0.95)
      )
  )
)

# order dataset and reverse ranks (so biggest value is ranked first)
testdata <- testdata[order(testdata[,"rank"]),]
testdata[,"rank"] <- rev(testdata[,"rank"])
testdata <- rev(testdata)

# fruitless attempt to calculate someting like an enrichment score
testdata[,c("ES_random_set","ES_enriched_set")] <-
  apply(testdata[,c("random_set","enriched_set")],2,function(x) {
    enscore <- as.numeric(x)
    # normalize magnitutde for selection size
    enscore[enscore == 0] <- (-1 / (length(x) / sum(x) - 1))
    return(round(cumsum(enscore),3)*-1)
  })

### plot the wannabe enrichment score

plot_es <- function(plotdata,set,es){
  library("ggplot2")
  library("gridExtra")
 
  curve <- ggplot(plotdata,aes_string(x="rank", y="value")) +
    geom_line() + geom_point(data=plotdata[plotdata[,set],],color="red",size=5,shape=124)
  esplot <- ggplot(plotdata,aes_string(x="rank", y=es)) + geom_line()
  return(grid.arrange(curve,esplot,nrow=2,heights=c(3,1)))
}

plot_es(testdata,"random_set","ES_random_set")
plot_es(testdata,"enriched_set","ES_enriched_set")

 

ADD COMMENTlink modified 17 months ago by Gordon Smyth32k • written 17 months ago by Matthias Z.20
4
gravatar for Gordon Smyth
17 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

This is GSEA-like rather than GSEA:

library(limma)
barcodeplot(testdata$value, testdata$random_set)
barcodeplot(testdata$value, testdata$enriched_set)

Barcode plots of random and enriched sets.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Gordon Smyth32k

Hello Gordon,

Thanks a lot for your answer, this indeed looks tremendously helpful. It seems I was trying to reinvent the wheel here.

I will probably be able to use the barcodeplot right away for my project or can at least dissect the function. Anyway this should clearly solve my problems.

Best regards

Matthias

ADD REPLYlink written 17 months ago by Matthias Z.20
3
gravatar for Simon Anders
17 months ago by
Simon Anders3.4k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.4k wrote:

My understanding is that the increments and decrements are chosen such that you end up where you started, at 0.

Let's say, your ranked list contains 10,000 genes and your gene set has 200 gene, and you go up by a each time you encounter a gene in the set, then down by b in all the other steps. For the sum to be zero once you have gone through the list, you need to have 

   a * 200 - b * 9,800 = 0,

i.e., you could use a=1 and b=200/9800. This is, if I recall correctly, what GSEA does.

Also, if I recall correctly (but please double-check; it's late here and I might be wrong): The maximum absolute value that you encounter while going through the lsi, the statistic they use, is just the Kolmogorov-Smirnov (KS) statistic, i.e., the original Mootha et al. paper essentially described a KS test without calling it that way. The difference, however, is that they don't use the KS distribution but a sample permutation scheme to get values. This last part is because the KS test corresponds to permutations of genes, not of samples, and the latter seems more appropriate.

Edit: No need to double-check; I have just googled myself. See this paper for a more competent discussion than my blabbing:

Irizarry, Zhu, Wang, and Speed (2009): Gene set enrichment analysis made easy. Stat Methods Med Res. 2009 Dec; 18(6): 565–575. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3134237/

ADD COMMENTlink modified 17 months ago • written 17 months ago by Simon Anders3.4k

Hello Simon,

Thanks a lot for your detailed and educational explanation and of course the link to the helpful paper with the formula I was looking for. Luckily the whole significance testing and comparison of various sets is not relevant for my project at the moment, but good to know anyway.

Best regards

Matthias

ADD REPLYlink written 17 months ago by Matthias Z.20
1
gravatar for Philip Lijnzaad
17 months ago by
European Union
Philip Lijnzaad160 wrote:

Why don't you simply run a t-test on the values annotated with a particular term, versus those not annnotated with a particular term? This approach is taken by e.g. Boorsma et al. 2005. Much simpler, cleaner and faster than GSEA's convoluted statistic, IMHO. You can of course use mann-withney-wilcoxon. For finding changes that cancel each other (i.e., an increase in variance), I would advocate the same approach but using the Browne-Forsythe test (see lawstat::levene.test). Be sure to do appropriate multiple testing correction.

ADD COMMENTlink written 17 months ago by Philip Lijnzaad160

Hello Phillip,

Thanks for your anwer. I will go with Gordons suggestion here, but I am sure your suggested tests will be of use for future problems.

Best regards

Matthias

ADD REPLYlink written 17 months ago by Matthias Z.20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 286 users visited in the last hour