Question: GSVA indifferent to 'abs.ranking'
4
sina.nassiri90 wrote:

Dear GSVA community,

I was playing with the package and noticed that my results are indifferent to abs.ranking argument in the gsva() function. After carefully examining all related documents (original publication by Hanzelmann et al., package vignette, as well as source code), it appears to me that gsva() function in its current configuration (v1.24.0 on May 4, 2017) ALWAYS indicates pathways with genes enriched on either extreme (high or low) as "activated", regardless of whether abs.ranking is set to TRUE or FALSE. Before explaining why I think that is the case, let me provide some necessary background info (codes are taken directly from GSVA source file). Briefly, GSVA performs sample-wise gene set analysis in 3 steps:

STEP 1- Starting with a p(gene) x n(sample) matrix of expression values (let's assume we're dealing with microarray intensity values here), the expression profile of every gene is first brought to a common scale between 0 and 1 using an empirical cumulative distribution function (ECDF):

if (kernel) {…}
# let's assume kernel=FALSE so that ECDF is directly estimated from observed data
else { gene.density <- t(apply(expr, 1, function(x, sample.idxs) {
f <- ecdf(x[sample.idxs])
f(x)
}, sample.idxs))
gene.density <- log(gene.density / (1-gene.density))
}
return (gene.density)
# the authors refer to scaled expression values as z_ij
}

STEP 2- z_ij are converted to ranks for each sample and further normalized to be symmetric around zero. This is the step where abs.ranking comes into play.

rank.scores <- rep(0, num_genes)

if(abs.ranking){
sort.sgn.idxs <- apply(abs(gene.density), 2, order, decreasing=TRUE)
# n.genes * n.samples
}
else {
sort.sgn.idxs <- apply(gene.density, 2, order, decreasing=TRUE)
# n.genes * n.samples
}
# the authors refer to initial rank scores as z_(i)j

compute_rank_score <- function(sort_idx_vec){
tmp <- rep(0, num_genes)
tmp[sort_idx_vec] <- abs(seq(from=num_genes,to=1) - num_genes/2)
return (tmp)
}

rank.scores <- apply(sort.sgn.idxs, 2, compute_rank_score)
# the authors refer to final rank scores as r_ij

STEP 3- Finally, rank scores (r_ij) are converted to Kolmogorov-Smirnov like random walk statistic, and enrichment scores (ES ) are calculated using one of the two approaches provided (equations 4 and 5  in the original publication). Basically, if mx.diff=FALSE, ES are calculated as the maximum distance of the random walk from 0, and if mx.diff=TRUE(default), ES are calculated as the magnitude difference between the largest positive and negative random walk deviations.

Here's the problem: Assume running gsva() with mx.diff=TRUE(default), using a heterogeneous gene set with genes enriched on both extremes. Logit transformation in step 1 yields negative and positive z_ij. According to authors, taking the absolute value of z_ij (abs.ranking=TRUE) will ensure that gene sets with genes enriched on either extreme (high or low) are regarded as “activated". This is similar to using the absolute value of test-statistic in the popular GSEA (Subramanian et al., PNAS 2005) as proposed by Saxena et al., Nucleic Acid Research 2006. On the other hand, if genes are ranked according to their sign (abs.ranking=FALSE), the largest positive and negative random walk deviations will cancel each other out (see attached images for a hypothetical sample with 10 genes), emphasizing gene sets with genes concordantly activated in one direction only. Problem is, in step 2 where rank scores are further normalized to be symmetric around zero, r_ij = | z_(i)j - p/2 | where p is total number of genes, use of absolute value will overturn abs.ranking=FALSE and project genes enriched on either extreme (high or low) on the same side of the spectrum. Consequently, calculation of enrichment scores appears to be indifferent with respect to 'abs.ranking' status.

example_lower_end.jpeg

example_higher_end.jpeg

example_both_ends.jpeg

Am I missing something here? Has anyone successfully explored the abs.ranking functionality before?

1

Here is a demo following the nomenclature of provided example in gsva() help document:

# simulating data
p <- 10 ## number of genes
n <- 6 ## number of samples
nGrp1 <- 3 ## number of samples in group A
nGrp2 <- n - nGrp1 ## number of samples in group B

# study design
design <- cbind.data.frame(sample=paste("s",1:n,sep=""),conditon=rep(c("A","B"),each=3))

# consider several disjoint gene sets
geneSets <- list(set1=paste("g", 1:3, sep=""),
set2=paste("g", 4:7, sep=""),
set3=paste("g", 8:10, sep=""),
set4=c(paste("g", 1:3, sep=""),paste("g", 8:10, sep="")),
set5=paste("g", c(1,3,5,6,8,10), sep=""))

# sample data from a normal distribution with mean 0 and st.dev. 1
set.seed(256666)
y <- matrix(rnorm(n*p), nrow=p, ncol=n,
dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))

# assume genes in set1 are expressed at higher levels in group B:
y[geneSets$set1, (nGrp1+1):n] <- y[geneSets$set1, (nGrp1+1):n] + 3
# assume genes in set3 are expressed at lower levels in group B:
y[geneSets$set3, (nGrp1+1):n] <- y[geneSets$set3, (nGrp1+1):n] - 3

# running gsva
gsva.absF <- gsva(expr=y,
gset.idx.list=geneSets,
method="gsva",
rnaseq=FALSE,
abs.ranking=FALSE,
min.sz=1,
max.sz=Inf,
mx.diff=TRUE,
kernel=TRUE ## Guassian kernel
)

gsva.absT <- gsva(expr=y,
gset.idx.list=geneSets,
method="gsva",
rnaseq=FALSE,
abs.ranking=TRUE,
min.sz=1,
max.sz=Inf,
mx.diff=TRUE,
kernel=TRUE ## Guassian kernel
)

# see for yourself that results are identical
gsva.absF <- gsva.absF$es.obs gsva.absT <- gsva.absT$es.obs

# pay attention to gene set 4
boxplot(gsva.absT[4,]~design$conditon, main="GeneSet 4, abs.ranking=T", ylab="ES") boxplot(gsva.absF[4,]~design$conditon, main="GeneSet 4, abs.ranking=F", ylab="ES")

3
Robert Castelo2.3k wrote:

Dear Sina,

thank you very much for your detailed bug report. Indeed, the calculation of GSVA scores with the non-default configuration where 'abs.ranking=TRUE' was doing the same calculations as if 'abs.ranking=FALSE', as long as the default 'kernel=TRUE' remained unchanged, otherwise ('kernel=FALSE') the calculations were correct. This was caused by a line in the C code that was commented during some testing and should have been left uncommented, and has nothing to do with the method and the math described in the paper, which is correct. This has been fixed in the release (1.24.1) and devel (1.25.1) versions of the package that will become available in the forthcoming 24/48 hours.

best regards,

robert.

Dear Dr. Castelo,

Thank you for the feedback. I upgraded to the latest version (1.24.1) and reran the example above. Results from abs.ranking=TRUE and abs.ranking=FALSE are no longer identical, but I'm still confused about the utility of abs.ranking. Let's return to above example in which set4 is a mixture of genes ranked at both extremes. Running gsva() with method="gsva", mx.diff=TRUE, and abs.ranking=FALSE returns enrichment scores close to zero for both groups (no difference between A and B) as expected. Setting abs.ranking=TRUE is supposed to identify sets with genes enriched on either extreme (high or low) as 'highly' activated, however my analysis shows otherwise. In fact, with abs.ranking=TRUE not only set4 shows no enrichment in group B, but also for some reason sets with genes enriched on only one of the extremes (e.g. set 1 or 3) lose their enrichment. Attached please find results based on simulation on a larger dataset with similar design (100 genes x 10 samples):

https://www.dropbox.com/s/f8uwwcs3ibx6fqy/revisiting%20set4.pdf?dl=0

I would appreciate it if you clarify what exactly abs.ranking is expected to achieve.

Best,
Sina