Differential expression on GSVA scores using LIMMA
Entering edit mode
rocanja ▴ 60
Last seen 15 months ago

I am using limma on GSVA scores to assess differential expression of gene sets (in microarray and RNAseq data). Since GSVA scores can be negative, I am wondering how limma calculates the fold changes between a negative and a positive GSVA score and how meaningful a fold change cutoff would be to define differential expression of gene sets (in addition to applying a p.value cutoff).

I am familiar with using limma on gene level data and how the fold changes are derived from log2 intensities in the case of microarray data. However, extrapolating this formula onto the scenario of negative and positive GSVA scores doesn't seem to provide any meaningful result, since they are not log transformed values to start with. I am wondering whether the GSVA scores require some kind of pre-processing before being passed onto limma, although in the GSVA vignette, scores appear to be passed onto limma without any pre-processing, as far as I can see? However, in the GSVA vignette, no fold change cutoff is applied to define differential expression for gene sets, while a fold change cutoff is applied on gene level data.

limma gsva • 921 views
Entering edit mode
Robert Castelo ★ 2.7k
Last seen 12 weeks ago
Barcelona/Universitat Pompeu Fabra


Negative values are not problematic with limma, in the same way that they are not when they derive from log CPM units in RNA-seq and you use the limma-trend pipeline, or with some microarray platforms (see this post about it).

Regarding the interpretation of the GSVA units of expression, these are scores derived from a random walk through a ranking (see Eqs. (4) and (5) from the GSVA article) and therefore, they cannot be interpreted beyond the gene set having its genes towards the top of the ranking (positive values), towards the bottom (negative values) or uniformly distributed (around 0). There are many ways in which you can summarize the expression of a set of genes in a sample, such as calculating their mean, z-score or first right-singular vector from SVD to name a few, and each of them has its own advantages and caveats. The interpretation of a difference in gene-set level summaries of expression as fold-changes is tricky because to say whether a gene set or a pathway is two-fold over-expressed you have first to define what are the units of expression at that gene or pathway level and, as far as I know, there is no independent biochemical assay that allows you to measure the expression of a gene set or a pathway, deriving a fold-change against which you could compare or calibrate the difference or fold-change estimated from a high-throughput profiling assay.



Entering edit mode

Thank you - that confirms my thoughts on the topic. However, even though we can't really express the difference between two GSVA scores in terms of a 'traditional' fold-change, one may still like to use some measure that describes the extend/magnitude of change between two scores (e.g. the difference a-b) to further rank or shortlist gene sets that are defined as 'differentially expressed' based on a p.value cutoff!?

Entering edit mode

Yes, note that while the magnitude of change may be tricky to interpret, the p-value has a very precise interpretation because rejecting the null hypothesis will allow you to say something about the association between the gene set and the explanatory variable for which the null hypothesis of its coefficient was rejected, after correcting for multiple testing. The ranking of by the magnitude of change among those gene sets that meet some multiple testing correction, should also be meaningful.


Login before adding your answer.

Traffic: 277 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6