GSEA-like enrichment score calculation
3
1
Entering edit mode
Matthias Z. ▴ 20
@matthias-z-8768
Last seen 6.7 years ago
University Medical Center of Münster, G…

Hello everybody,

I would like to calculate and plot the enrichment score of a selection from a ranked list, as many readers are familar with the gene set enrichment plots produced by the Broad Institute's GSEA implementation.

While the basic principle is clear to me, I failed to understand how the magnitude is determined, by which the running sum statistic is increased or decreased. The paper and the FAQ only say "the magnitude of the increment depends on the correlation of the gene with the phenotype" and a otherwise very helpful third-party site also doesn't explain how it is calculated.

Please find a working example code to illustrate my current approach below in case you would like to try it out for yourself. However to help me an explanation or formula to calculate the enrichment score would already suffice, since my current result doesn't look similar to the enrichment score of GSEA plots at all. Thanks a lot for your help.

Matthias


testvalues <- rweibull(1e4,shape=1,scale=2) + rnorm(1e4)
testdata <- data.frame(
  rank = rank(testvalues,ties.method = "random"),
  value = testvalues,
  random_set = sample(
    c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.02,0.98)
  ),
  enriched_set = (  # place random TRUES, but only in the upper 50% range
    rank(testvalues,ties.method = "random") > 5e3 &
      sample(
        c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.05,0.95)
      )
  )
)

# order dataset and reverse ranks (so biggest value is ranked first)
testdata <- testdata[order(testdata[,"rank"]),]
testdata[,"rank"] <- rev(testdata[,"rank"])
testdata <- rev(testdata)

# fruitless attempt to calculate someting like an enrichment score
testdata[,c("ES_random_set","ES_enriched_set")] <-
  apply(testdata[,c("random_set","enriched_set")],2,function(x) {
    enscore <- as.numeric(x)
    # normalize magnitutde for selection size
    enscore[enscore == 0] <- (-1 / (length(x) / sum(x) - 1))
    return(round(cumsum(enscore),3)*-1)
  })

### plot the wannabe enrichment score

plot_es <- function(plotdata,set,es){
  library("ggplot2")
  library("gridExtra")
 
  curve <- ggplot(plotdata,aes_string(x="rank", y="value")) +
    geom_line() + geom_point(data=plotdata[plotdata[,set],],color="red",size=5,shape=124)
  esplot <- ggplot(plotdata,aes_string(x="rank", y=es)) + geom_line()
  return(grid.arrange(curve,esplot,nrow=2,heights=c(3,1)))
}

plot_es(testdata,"random_set","ES_random_set")
plot_es(testdata,"enriched_set","ES_enriched_set")

 

GSEA enrichment analysis gseabase • 5.5k views
ADD COMMENT
4
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

This is GSEA-like rather than GSEA:

library(limma)
barcodeplot(testdata$value, testdata$random_set)
barcodeplot(testdata$value, testdata$enriched_set)

Barcode plots of random and enriched sets.

ADD COMMENT
0
Entering edit mode

Hello Gordon,

Thanks a lot for your answer, this indeed looks tremendously helpful. It seems I was trying to reinvent the wheel here.

I will probably be able to use the barcodeplot right away for my project or can at least dissect the function. Anyway this should clearly solve my problems.

Best regards

Matthias

ADD REPLY
3
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 3.7 years ago
Zentrum für Molekularbiologie, Universi…

My understanding is that the increments and decrements are chosen such that you end up where you started, at 0.

Let's say, your ranked list contains 10,000 genes and your gene set has 200 gene, and you go up by a each time you encounter a gene in the set, then down by b in all the other steps. For the sum to be zero once you have gone through the list, you need to have 

   a * 200 - b * 9,800 = 0,

i.e., you could use a=1 and b=200/9800. This is, if I recall correctly, what GSEA does.

Also, if I recall correctly (but please double-check; it's late here and I might be wrong): The maximum absolute value that you encounter while going through the lsi, the statistic they use, is just the Kolmogorov-Smirnov (KS) statistic, i.e., the original Mootha et al. paper essentially described a KS test without calling it that way. The difference, however, is that they don't use the KS distribution but a sample permutation scheme to get values. This last part is because the KS test corresponds to permutations of genes, not of samples, and the latter seems more appropriate.

Edit: No need to double-check; I have just googled myself. See this paper for a more competent discussion than my blabbing:

Irizarry, Zhu, Wang, and Speed (2009): Gene set enrichment analysis made easy. Stat Methods Med Res. 2009 Dec; 18(6): 565–575. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3134237/

ADD COMMENT
0
Entering edit mode

Hello Simon,

Thanks a lot for your detailed and educational explanation and of course the link to the helpful paper with the formula I was looking for. Luckily the whole significance testing and comparison of various sets is not relevant for my project at the moment, but good to know anyway.

Best regards

Matthias

ADD REPLY
1
Entering edit mode
@philip-lijnzaad-2499
Last seen 21 months ago
European Union

Why don't you simply run a t-test on the values annotated with a particular term, versus those not annnotated with a particular term? This approach is taken by e.g. Boorsma et al. 2005. Much simpler, cleaner and faster than GSEA's convoluted statistic, IMHO. You can of course use mann-withney-wilcoxon. For finding changes that cancel each other (i.e., an increase in variance), I would advocate the same approach but using the Browne-Forsythe test (see lawstat::levene.test). Be sure to do appropriate multiple testing correction.

ADD COMMENT
0
Entering edit mode

Hello Phillip,

Thanks for your anwer. I will go with Gordons suggestion here, but I am sure your suggested tests will be of use for future problems.

Best regards

Matthias

ADD REPLY

Login before adding your answer.

Traffic: 1027 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6