Search
Question: GSVA: questions about the bootstrap p-value
1
11 months ago by
dionnezaal10
dionnezaal10 wrote:

Hi all, I am looking into the GSVA package and would like to calculate gsva scores. When using the gsva method from the package you can choose to do bootstrapping which in the end will give you a p-value for each sample and gene set combination. I am having some trouble understanding this p-value and what is actually bootstrapped. I could not find information in the GSVA tutorial or paper, but I found some information in the code from the package. My questions are the following:

- Am I correct that the samples (columns in the expression data) are bootstrapped?
- The part of the code where the p-value is calculated (see bottom of this post) states that a
non-parametric test is done to test if the median of the empirical distribution is 0. Why was there in this
case chosen for 0 as hypothesis?
- How should I interpret the p-value calculated? To me it seems to be the proportion of bootstrap scores in
the extreme side of the distribution?

Hope there is someone that can help me!
Kind regards, Dionne

## Code calculating p-value from bootstrapping
# es.obs = observed gsva score
# es.bootstraps = estimated gsva scores from the bootstrap
# no.bootstraps = number of bootstraps

for(i in 1:n.gset){

for(j in 1:n.samples){
# non-parametric test if median of empirical dist is 0
if(es.obs[i,j] > 0){
p.vals.sign[i,j] <- (1 + sum(es.bootstraps[i,j,] < 0)) / (1 + no.bootstraps)
}else{
p.vals.sign[i,j] <- (1 + sum(es.bootstraps[i,j,] > 0)) / (1 + no.bootstraps)
}
}
}

modified 11 months ago by Robert Castelo2.1k • written 11 months ago by dionnezaal10
1
11 months ago by
Robert Castelo2.1k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.1k wrote:

hi,

sorry for the delay in getting back to you. I'm a contributor to GSVA and not the maintainer of the package who added this feature but I'll try to answer. Indeed, bootstrapping was something added after the publication, which is the main reason why is not well described. Let me warn you that this is still an experimental feature, and therefore, it may change in the future the way it is working. Going to your specific questions

Am I correct that the samples (columns in the expression data) are bootstrapped?

yes

The part of the code where the p-value is calculated (see bottom of this post) states that a non-parametric test is done to test if the median of the empirical distribution is 0. Why was there in this case chosen for 0 as hypothesis?

I'm not the one who added this feature but my interpretation would be that, because under the null hypothesis, the two step-CDF, of genes inside the gene set and of genes outside the gene set, are identical, the resulting K-S statistic should be zero.

How should I interpret the p-value calculated? To me it seems to be the proportion of bootstrap scores in the extreme side of the distribution?

Again, because I'm not the one who added this feature I cannot safely answer your question but it looks like a non-parametric sign test and then it would be interpreted as the probability that the difference between the two step-CDF have zero median.

cheers,

robert.