Dear GSVA community,
I was playing with the package and noticed that my results are indifferent to abs.ranking
argument in the gsva()
function. After carefully examining all related documents (original publication by Hanzelmann et al., package vignette, as well as source code), it appears to me that gsva()
function in its current configuration (v1.24.0 on May 4, 2017) ALWAYS indicates pathways with genes enriched on either extreme (high or low) as "activated", regardless of whether abs.ranking
is set to TRUE or FALSE. Before explaining why I think that is the case, let me provide some necessary background info (codes are taken directly from GSVA source file). Briefly, GSVA performs sample-wise gene set analysis in 3 steps:
STEP 1- Starting with a p(gene) x n(sample) matrix of expression values (let's assume we're dealing with microarray intensity values here), the expression profile of every gene is first brought to a common scale between 0 and 1 using an empirical cumulative distribution function (ECDF):
if (kernel) {…} # let's assume kernel=FALSE so that ECDF is directly estimated from observed data else { gene.density <- t(apply(expr, 1, function(x, sample.idxs) { f <- ecdf(x[sample.idxs]) f(x) }, sample.idxs)) gene.density <- log(gene.density / (1-gene.density)) } return (gene.density) # the authors refer to scaled expression values as z_ij }
STEP 2- z_ij are converted to ranks for each sample and further normalized to be symmetric around zero. This is the step where abs.ranking
comes into play.
rank.scores <- rep(0, num_genes) if(abs.ranking){ sort.sgn.idxs <- apply(abs(gene.density), 2, order, decreasing=TRUE) # n.genes * n.samples } else { sort.sgn.idxs <- apply(gene.density, 2, order, decreasing=TRUE) # n.genes * n.samples } # the authors refer to initial rank scores as z_(i)j compute_rank_score <- function(sort_idx_vec){ tmp <- rep(0, num_genes) tmp[sort_idx_vec] <- abs(seq(from=num_genes,to=1) - num_genes/2) return (tmp) } rank.scores <- apply(sort.sgn.idxs, 2, compute_rank_score) # the authors refer to final rank scores as r_ij
STEP 3- Finally, rank scores (r_ij) are converted to Kolmogorov-Smirnov like random walk statistic, and enrichment scores (ES ) are calculated using one of the two approaches provided (equations 4 and 5 in the original publication). Basically, if mx.diff=FALSE
, ES are calculated as the maximum distance of the random walk from 0, and if mx.diff=TRUE(default)
, ES are calculated as the magnitude difference between the largest positive and negative random walk deviations.
Here's the problem: Assume running gsva()
with mx.diff=TRUE(default)
, using a heterogeneous gene set with genes enriched on both extremes. Logit transformation in step 1 yields negative and positive z_ij. According to authors, taking the absolute value of z_ij (abs.ranking=TRUE) will ensure that gene sets with genes enriched on either extreme (high or low) are regarded as “activated". This is similar to using the absolute value of test-statistic in the popular GSEA (Subramanian et al., PNAS 2005) as proposed by Saxena et al., Nucleic Acid Research 2006. On the other hand, if genes are ranked according to their sign (abs.ranking=FALSE), the largest positive and negative random walk deviations will cancel each other out (see attached images for a hypothetical sample with 10 genes), emphasizing gene sets with genes concordantly activated in one direction only. Problem is, in step 2 where rank scores are further normalized to be symmetric around zero, r_ij = | z_(i)j - p/2 | where p is total number of genes, use of absolute value will overturn abs.ranking=FALSE
and project genes enriched on either extreme (high or low) on the same side of the spectrum. Consequently, calculation of enrichment scores appears to be indifferent with respect to 'abs.ranking' status.
Am I missing something here? Has anyone successfully explored the abs.ranking
functionality before?
Looking forward to your feedback!
Here is a demo following the nomenclature of provided example in
gsva()
help document: