I'm looking for guidance on RNA-seq data pre-processing for GSVA.
The GSVA package implements several methods for computing sample-wise gene set enrichments scores:
GSVA, ssGSEA, z-score and PLAGE
From reading about these methods, it's apparent to me that GSVA, z-score, and PLAGE require library size normalization per sample, whereas ssGSEA does not. Is this correct?
On the other hand, ssGSEA requires read counts per gene to be adjusted for gene length and GC content, whereas the other methods do not. Is this correct?
GSVA standardizes gene counts by mapping them to KCDF values estimated from the data. Two different kernels are offered by GSVA for gene-wise KCDF estimation: Poisson and Guassian
To use the Gaussian kernel, count data must be log-transformed (for example log2 count per million or log2 read per kilo-base per million). For the Poisson Kernel, log transformation should not be performed but count per million may be multiplied by 1x10^6 rounded to the nearest integer before being passed to GSVA. Is this correct?
Thank you for your assistance!
Thank you for the clarification!
When you say "integer counts", do you mean that they have to be the raw counts or can they be normalized counts (wouldn't be integers)?
In one hand, Poisson kernels are only suitable for integer values, on the other hand, the GSVA algorithm assumes that the distributions of expression values across samples are comparable and thus that the input data are normalized. The help page of the sample RNA-seq data used in the vignette of GSVA, accessible by doing:
explains how did we process the RNA-seq data for the vignette and the GSVA article and you will see that we mentioned that we normalized the data, but the help page does not include details on exactly how.
When we developed this method back in 2010-2011 (GSVA forms part from BioC since release 2.8), we normalized the RNA-seq data either using TMM or CQN, and transformed the normalized values back to integer counts. I believe that the developers of TMM and CQN do not recommend such a practice anymore (back then in 2011, I don't recall that there was such a consensus). In our hands, with the RNA-seq data shown in the vignette and in the GSVA article, the approach gave good results regarding the agreement of GSVA scores between microarray and RNA-seq data. I believe that today the general recommendation by the RNA-seq normalization experts is to work with normalization factors or normalized continuous values, and in this latter case you would use with GSVA the Gaussian kernel.
If you want to try the Poisson kernel, you can transform the a TMM-normalized RNA-seq data back into integer counts, as follows:
experts in this forum about normalization may have a different advice on how to do this step.
cheers,
robert.
Thank you for clarifying.
Would I also need to length-normalize (convert to TPMs/FPKMs) if I use the Gaussian kernel? Or would normalized counts be appropriate?
I think that for a gene-set level analysis you'll be fine with TMM-normalization and logCPM units per gene, which should be used with the Gaussian kernel. On the other hand, in a two recent publications (see Soneson et al., 2015, and Yi et al., 2018) it has been argued that calculating transcript-level estimates of expression, such as TPM units, and using them to obtain gene-level estimates, improves gene-level analysis. From that perspective, maybe one could expect that gene-set level analyses improve as well but as far as I know, this has not been explored. Others in this forum may have more informed opinions about this point.
I tried running the example you posted. The result is close to yours:
I also tried converting to logCPM values and using the Gaussian kernel:
The enrichment scores are now much higher. Is that expected?
the code and results above are based on the example of the help page of 'equalizeLibSizes()' from the edgeR package. If you read carefully that example you will see that (1) count values are sampled randomly from two different Poisson distributions, and therefore two different runs will give slightly different results; (2) the generated count data matrix has 2 samples (columns) and this is not enough for GSVA to estimate correctly the gene expression level statistics (see step 1 in figure 1 from Hanzelmann et al., 2013). Modifying the example from the 'equalizeLibSizes()' help page you can get an exactly reproducible example with sufficient sample size as follows:
as you can see, there is a nearly perfect agreement between GSVA scores derived from normalized counts and those derived from logCPM values, calculated from the same data set. This level of agreement may decrease with lower sample size and may also depend on other distributional features from data, since this example is restricted to synthetic Poisson data.
Thank you for the explanation. I should've tried more samples. Using the updated example and real samples produces reasonable results.
There is a recent related thread on the GenePattern forum:
You can run ssGSEA using RNA-Seq data. This would generally work if your samples raw counts are not too different from one another. If counts vary too much, you may consider normalizing your data (e.g., so that they are between 0 and 1). Additionally, I would advise you to run your raw RNASeq counts through the module PreprocessReadCounts first. This module was designed to transform RNA-Seq data into the type of distribution that modules such as ssGSEA would expect.
The PreprocessReadCounts module returns logCPM values, so I guess that would be the most appropriate input for ssGSEA.