Search
Question: Does GSVA enrichment score comes negative?
0
gravatar for pbachali
2.5 years ago by
pbachali10
pbachali10 wrote:

Hi Robert,

I am Prathyusha Bachali, student at UNCC. I am trying to work with GSVA package. I have created my geneset collection object and now I am trying to use GSVA function by giving expressionset which is basically all positive values of genes by sample and my geneset collection object which is collection of list of genesets with geneSymbol as identifier. My gsva function is working fine, but I am not able to interpret my results or I am not sure how GSVA score has been calculated here. Because the result, I obtained after running GSVA has some negative and positive scores. What does negative score represent here?

Here is my input expressionset object. There are 1819 genes with expression values.

geneSymbol CTL1 CTL2 SLE1 SLE2

ESRRA        6.683  6.525  6.461 6.392

CAPNS1      10.591 10.047 11.52  11.86

After performing gsva on my expressionset matching with GenesetCollection object, I got the follwong output with few negative scores. I believe these are GSVA enrichment scores.

geneSet         CTL1     CTL2   SLE1      SLE2

Immunescreen 0.012   -0.1264 -0.2167 -0.2767

ISG                   -0.032   0.02867  0.057  -0.078

Thanks in advance. Any help is much appreciated.

 

Thanks,

Prat

 

ADD COMMENTlink modified 19 months ago • written 2.5 years ago by pbachali10
2
gravatar for Robert Castelo
2.5 years ago by
Robert Castelo2.2k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.2k wrote:

Hi Prat,

the GSVA paper, available in open access here, contains all the technical details. In a nutshell, GSVA scores are calculated non-parametrically using a KS-like random walk statistic and a negative value for a particular sample and gene set, means that the gene set has a lower expression than the same gene set with a positive value in a different sample, or than another different gene set with a positive value. Whether you should expect positive or negative values for a particular gene set depends on the expression levels of the genes that form that gene set with respect to the expression levels of the genes outside that gene set. Think of a ranking of genes by decreasing expression levels in a particular sample, so that the top of the ranking contains the genes with higher expression levels and the bottom of the ranking contains the genes with lower expression levels. Now imagine a gene set whose genes are mostly located at the bottom of that ranking. That gene set is likely to get a negative KS-like random walk statistic (what we simply call a "GSVA score"). Here below, you should see the image of Figure 1 of the paper, step three of the sketched GSVA algorithm shows a toy KS-like random walk for a gene set with 3 genes whose expression values are at the top of the ranking. This will lead to a positive GSVA score, but if you imagine the gene set at the bottom of the ranking, the two random walk step CDFs (inside and outside the gene set) would be inverted and the GSVA score would be negative.

Figure 1. Hänzelmann et al. BMC Bioinformatics, 2013

 

cheers,

robert.

ADD COMMENTlink written 2.5 years ago by Robert Castelo2.2k
0
gravatar for pbachali
2.4 years ago by
pbachali10
pbachali10 wrote:

Thank you Robert. I understood now clearly. 

ADD COMMENTlink written 2.4 years ago by pbachali10
0
gravatar for pbachali
19 months ago by
pbachali10
pbachali10 wrote:

Hi Robert,

I am pretty much running GSVA with all my datasets with my differentially expressed genes as input and custom gene sets as my reference. I am having a hard time to interpret the negative GSVA scores. In simple words, I am able to figure out that probably the expression values of the overlapped genes would be low to obtain negative GSVA score. But, the point I am not able to understand is two random walk step CDFs in your above reply. If you could explain me in in detail that would be great. 

I really appreciate.

 

Prat

ADD COMMENTlink written 19 months ago by pbachali10

Hi Prat,

please next time to use the 'ADD COMMENT' link, which i'm using right now, to make comments, remarks and/or follow up questions, such as your two last messages in this page, to keep the answer slots only for new answers. this helps structuring conversation about a particular topic.

regarding how negative scores come out of the GSVA algorithm, i do not have anything else to add to what i already said in my first answer above, and have no time to lecture you on this subject. you can read the paper and you can look at the source code to find your way through the algorithm. if you still do not understand how it works, you should try to formulate questions about the specific parts that you do not understand. i'm sorry i can't be more helpful this time.

cheers,

robert.

 

ADD REPLYlink written 19 months ago by Robert Castelo2.2k

Hi Robert,

Thanks much. My apologies if I would have bothered you more regarding the GSVA. 

 

Prat

ADD REPLYlink written 19 months ago by pbachali10

Hi Robert,

I have been using GSVA extensively for pathway centric analysis to understand the heterogenous populations and understand the pathways in each patient. This is quite a powerful program. I have a small question. It might be a simple one. As explained in the paper the input for the gsva is log 2 expression values. Here I am using the log 2 transformed values of the DE genes at FDR 0.02%. In order to be more confident about results we are limiting our input to DE genes significant at FDR 0.02%. We If I do like this do you think I am loosing the power of GSVA. Or is it good practice to use all the genes left after filtering the low variance genes, duplicate genes, genes with out entrez ids, and the control probes? We are wrapping up our paper. I would really appreciate if you give your insight on the input I am using currently. 

We are using GSVA approach mainly for looking for drug molecules targeting pathways. So I was wondering using the significant DE genes as my input would be good idea. Thanks in advance. 

 

ADD REPLYlink written 8 months ago by pbachali10

Hi Prat,

please read carefully the Bioconductor Posting Guide, which contains guidelines on what are the best practices in using this support site. These best practices are there for the benefit of everyone, including your own. In particular, if you look at the guidelines for "Composing", the first one says "Compose a new message with a new subject line; only reply to an existing post if you are elaborating on or answering a previous question". Because you're not elaborating on a previous question, what you are asking now would better fit into a new question with an appropriate specific subject and tags. This helps building a knowledge base on the use of a package and helps finding answers to previously posted questions. I'm sure you've already benefited from this strategy, but its success depends on the proper use by every one of us.

That said, the answer to your question depends on what are you doing with the GSVA scores. How are you using them once they are calculated? (i.e., for exploratory/visualization purposes? for inferential purposes -testing of some kind? etc.)

 

ADD REPLYlink written 8 months ago by Robert Castelo2.2k

Hi Robert,

Firstly my apologies for posting it incorrectly. I thought I would follow the same thread since it is all about GSVA. I would post it correctly next time. Since I have started already here, I am using "ADD REPLY" for now. But from next time I will make sure I am posting the question correctly. 

On a bigger picture we are using GSVA scores for drug repurposing and also trying to understand the pathogenesis of Lupus auto-immune disorder. We are trying to make custom gene sets (like different cell types, genes reacting for the drug treatments, etc.,) and then trying to see how these gene sets are behaving in our expression profiling datasets. While I am using the matrix of log2 expression values of DE genes significant at FDR 0.02% as my input and the custom gene sets as our reference, and apply the "gsva" method we are seeing some interesting results. Now I am not sure if I might need to broaden my approach and take all the non significant DE genes into consideration as well. I am little confused at this point of time in order to choose which approach would be better. I am concerned that if add the non significant DE genes would there be any chance of increasing false positives in my results?

Thank you so much again for answering my questions. I will definitely make sure next time that I post correctly. I really appreciate your time. 

Prat

ADD REPLYlink written 8 months ago by pbachali10

Hi Prat,

if you are using GSVA scores for inferential purposes such as selecting gene sets that are differentially expressed, then i'd recommend to start with the whole set of genes, discard those that are lowly expressed and calculate GSVA scores over the collection of gene sets of your interest. Then, do your differential expression analysis over those GSVA scores. If you are using GSVA scores for exploratory/visualization purposes, then what you are doing using only DE genes is already fine. Since you did not answer may question before, i don't know what you mean by "increasing false positives". Regarding these messages, one should write a thread for each different question, and not for each different package. If you think some of the answers address your question, you should upvote them. This also helps guiding people to useful answers.

ADD REPLYlink written 8 months ago by Robert Castelo2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 162 users visited in the last hour