Question: Pre-processing RNA-seq data (normalization and transformation) for GSVA
4
12 months ago by
rached40
rached40 wrote:

I'm looking for guidance on RNA-seq data pre-processing for GSVA.

The GSVA package implements several methods for computing sample-wise gene set enrichments scores:

GSVA, ssGSEA, z-score and PLAGE

From reading about these methods, it's apparent to me that GSVA, z-score, and PLAGE require library size normalization per sample, whereas ssGSEA does not. Is this correct?

On the other hand, ssGSEA requires read counts per gene to be adjusted for gene length and GC content, whereas the other methods do not. Is this correct?

GSVA standardizes gene counts by mapping them to KCDF values estimated from the data. Two different kernels are offered by GSVA for gene-wise KCDF estimation: Poisson and Guassian

To use the Gaussian kernel, count data must be log-transformed (for example log2 count per million or log2 read per kilo-base per million). For the Poisson Kernel, log transformation should not be performed but count per million may be multiplied by 1x10^6 rounded to the nearest integer before being passed to GSVA. Is this correct?

modified 12 weeks ago • written 12 months ago by rached40
Answer: Pre-processing RNA-seq data (normalization and transformation) for GSVA
2
12 months ago by
Robert Castelo2.3k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.3k wrote:

Hi,

regarding ssGSEA, I implemented the method in the GSVA package but I did not developed it; see the original publication by Barbie et al. (2009) for the methodological details. My understanding is that it was designed for microarray data (see documentation of its official implementation in GenePattern here) and I don't know of any publication where its suitability with RNA-seq data has been assessed, and therefore, I can't comment further. Maybe others in this support forum have experience in using ssGSEA with RNA-seq .

When using the Poisson kernel in the GSVA algorithm, the input matrix of expression values should be formed by integer counts.

cheers,

robert.

Thank you for the clarification!

When you say "integer counts", do you mean that they have to be the raw counts or can they be normalized counts (wouldn't be integers)?

1

In one hand, Poisson kernels are only suitable for integer values, on the other hand, the GSVA algorithm assumes that the distributions of expression values across samples are comparable and thus that the input data are normalized. The help page of the  sample RNA-seq data used in the vignette of GSVA, accessible by doing:

library(GSVAdata)
help(commonPickrellHuang)

explains how did we process the RNA-seq data for the vignette and the GSVA article and you will see that we mentioned that we normalized the data, but the help page does not include details on exactly how.

When we developed this method back in 2010-2011 (GSVA forms part from BioC since release 2.8), we normalized the RNA-seq data either using TMM or CQN, and transformed the normalized values back to integer counts. I believe that the developers of TMM and CQN do not recommend such a practice anymore (back then in 2011, I don't recall that there was such a consensus). In our hands, with the RNA-seq data shown in the vignette and in the GSVA article, the approach gave good results regarding the agreement of GSVA scores between microarray and RNA-seq data. I believe that today the general recommendation by the RNA-seq normalization experts is to work with normalization factors or normalized continuous values, and in this latter case you would use with GSVA the Gaussian kernel.

If you want to try the Poisson kernel, you can transform the a TMM-normalized RNA-seq data back into integer counts, as follows:

library(edgeR)
example(equalizeLibSizes) ## sample integer count data in a matrix called 'counts'
dim(counts)
[1] 1000    2
colSums(counts)
Sample1 Sample2
10086   19838

dge <- DGEList(counts)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)
eqlcounts <- equalizeLibSizes(dge)
normcounts <- ceiling(eqlcounts$pseudo.counts-0.5) colSums(normcounts) Sample1 Sample2 14160 14121 library(GSVA) gsva(normcounts, list(GS1=1:10, GS2=20:30), kcdf="Poisson") Sample1 Sample2 GS1 0.09383583 -0.11740335 GS2 -0.09488408 0.09572034 experts in this forum about normalization may have a different advice on how to do this step. cheers, robert. ADD REPLYlink modified 10 months ago • written 10 months ago by Robert Castelo2.3k Thank you for clarifying. Would I also need to length-normalize (convert to TPMs/FPKMs) if I use the Gaussian kernel? Or would normalized counts be appropriate? ADD REPLYlink written 10 months ago by igor20 1 I think that for a gene-set level analysis you'll be fine with TMM-normalization and logCPM units per gene, which should be used with the Gaussian kernel. On the other hand, in a two recent publications (see Soneson et al., 2015, and Yi et al., 2018) it has been argued that calculating transcript-level estimates of expression, such as TPM units, and using them to obtain gene-level estimates, improves gene-level analysis. From that perspective, maybe one could expect that gene-set level analyses improve as well but as far as I know, this has not been explored. Others in this forum may have more informed opinions about this point. ADD REPLYlink written 10 months ago by Robert Castelo2.3k I tried running the example you posted. The result is close to yours: > gsva(normcounts, list(GS1=1:10, GS2=20:30), kcdf="Poisson") Sample1 Sample2 GS1 -0.148823642 0.19004598 GS2 0.002574801 0.04970183 I also tried converting to logCPM values and using the Gaussian kernel: > gsva(cpm(dge, log = TRUE), list(GS1=1:10, GS2=20:30), kcdf="Gaussian") Sample1 Sample2 GS1 0.8953349 0.9822134 GS2 0.9471186 0.9289494 The enrichment scores are now much higher. Is that expected? ADD REPLYlink written 10 months ago by igor20 1 the code and results above are based on the example of the help page of 'equalizeLibSizes()' from the edgeR package. If you read carefully that example you will see that (1) count values are sampled randomly from two different Poisson distributions, and therefore two different runs will give slightly different results; (2) the generated count data matrix has 2 samples (columns) and this is not enough for GSVA to estimate correctly the gene expression level statistics (see step 1 in figure 1 from Hanzelmann et al., 2013). Modifying the example from the 'equalizeLibSizes()' help page you can get an exactly reproducible example with sufficient sample size as follows: library(edgeR) library(GSVA) nlibs <- 30 counts <- matrix(0,ngenes,nlibs) set.seed(123) ## set the random seed to facilitate reproducing the numbers counts[, 1:15] <- rpois(ngenes*15, lambda=10) counts[, 16:30] <- rpois(ngenes*15, lambda=20) dge <- DGEList(counts) dge <- calcNormFactors(dge) dge <- estimateDisp(dge) eqlcounts <- equalizeLibSizes(dge) normcounts <- ceiling(eqlcounts$pseudo.counts-0.5)
gsvacnt <- gsva(normcounts, list(GS1=1:10, GS2=20:30), kcdf="Poisson")
gsvalogcpm <- gsva(cpm(dge, log = TRUE), list(GS1=1:10, GS2=20:30), kcdf="Gaussian")
cor(gsvacnt["GS1", ], gsvalogcpm["GS1", ])
[1] 0.9799232
cor(gsvacnt["GS2", ], gsvalogcpm["GS2", ])
[1] 0.991606

as you can see, there is a nearly perfect agreement between GSVA scores derived from normalized counts and those derived from logCPM values, calculated from the same data set. This level of agreement may decrease with lower sample size and may also depend on other distributional features from data, since this example is restricted to synthetic Poisson data.

Thank you for the explanation. I should've tried more samples. Using the updated example and real samples produces reasonable results.

There is a recent related thread on the GenePattern forum:

You can run ssGSEA using RNA-Seq data. This would generally work if your samples raw counts are not too different from one another. If counts vary too much, you may consider normalizing your data (e.g., so that they are between 0 and 1). Additionally, I would advise you to run your raw RNASeq counts through the module PreprocessReadCounts first. This module was designed to transform RNA-Seq data into the type of distribution that modules such as ssGSEA would expect.

The PreprocessReadCounts module returns logCPM values, so I guess that would be the most appropriate input for ssGSEA.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.