Question

Pre-processing RNA-seq data (normalization and transformation) for GSVA

6

Entering edit mode

rached ▴ 60

@rached-14655

Last seen 5.1 years ago

I'm looking for guidance on RNA-seq data pre-processing for GSVA.

The GSVA package implements several methods for computing sample-wise gene set enrichments scores:

GSVA, ssGSEA, z-score and PLAGE

From reading about these methods, it's apparent to me that GSVA, z-score, and PLAGE require library size normalization per sample, whereas ssGSEA does not. Is this correct?

On the other hand, ssGSEA requires read counts per gene to be adjusted for gene length and GC content, whereas the other methods do not. Is this correct?

GSVA standardizes gene counts by mapping them to KCDF values estimated from the data. Two different kernels are offered by GSVA for gene-wise KCDF estimation: Poisson and Guassian

To use the Gaussian kernel, count data must be log-transformed (for example log2 count per million or log2 read per kilo-base per million). For the Poisson Kernel, log transformation should not be performed but count per million may be ~~multiplied by 1x10^6~~ rounded to the nearest integer before being passed to GSVA. Is this correct?

Thank you for your assistance!

gsva gene set variation analysis normalization transformation rna-seq • 9.3k views

ADD COMMENT • link 7.7 years ago • updated 6.8 years ago rached ▴ 60

score 2 · Answer 1 · 2018-03-29

2

Entering edit mode

Robert Castelo ★ 3.4k

@rcastelo

Last seen 11 weeks ago

Barcelona/Universitat Pompeu Fabra

Hi,

regarding ssGSEA, I implemented the method in the GSVA package but I did not developed it; see the original publication by Barbie et al. (2009) for the methodological details. My understanding is that it was designed for microarray data (see documentation of its official implementation in GenePattern here) and I don't know of any publication where its suitability with RNA-seq data has been assessed, and therefore, I can't comment further. Maybe others in this support forum have experience in using ssGSEA with RNA-seq .

When using the Poisson kernel in the GSVA algorithm, the input matrix of expression values should be formed by integer counts.

cheers,

robert.

ADD COMMENT • link 7.7 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thank you for the clarification!

ADD REPLY • link 7.6 years ago rached ▴ 60

0

Entering edit mode

When you say "integer counts", do you mean that they have to be the raw counts or can they be normalized counts (wouldn't be integers)?

ADD REPLY • link 7.5 years ago igor ▴ 50

2

Entering edit mode

In one hand, Poisson kernels are only suitable for integer values, on the other hand, the GSVA algorithm assumes that the distributions of expression values across samples are comparable and thus that the input data are normalized. The help page of the sample RNA-seq data used in the vignette of GSVA, accessible by doing:

library(GSVAdata)
help(commonPickrellHuang)

explains how did we process the RNA-seq data for the vignette and the GSVA article and you will see that we mentioned that we normalized the data, but the help page does not include details on exactly how.

When we developed this method back in 2010-2011 (GSVA forms part from BioC since release 2.8), we normalized the RNA-seq data either using TMM or CQN, and transformed the normalized values back to integer counts. I believe that the developers of TMM and CQN do not recommend such a practice anymore (back then in 2011, I don't recall that there was such a consensus). In our hands, with the RNA-seq data shown in the vignette and in the GSVA article, the approach gave good results regarding the agreement of GSVA scores between microarray and RNA-seq data. I believe that today the general recommendation by the RNA-seq normalization experts is to work with normalization factors or normalized continuous values, and in this latter case you would use with GSVA the Gaussian kernel.

If you want to try the Poisson kernel, you can transform the a TMM-normalized RNA-seq data back into integer counts, as follows:

library(edgeR)
example(equalizeLibSizes) ## sample integer count data in a matrix called 'counts'
dim(counts)
[1] 1000    2
colSums(counts)
Sample1 Sample2
  10086   19838

dge <- DGEList(counts)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)
eqlcounts <- equalizeLibSizes(dge)
normcounts <- ceiling(eqlcounts$pseudo.counts-0.5)
colSums(normcounts)
Sample1 Sample2
  14160   14121
library(GSVA)
gsva(normcounts, list(GS1=1:10, GS2=20:30), kcdf="Poisson")
        Sample1     Sample2
GS1  0.09383583 -0.11740335
GS2 -0.09488408  0.09572034

experts in this forum about normalization may have a different advice on how to do this step.

cheers,

robert.

ADD REPLY • link 7.5 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thank you for clarifying.

Would I also need to length-normalize (convert to TPMs/FPKMs) if I use the Gaussian kernel? Or would normalized counts be appropriate?

ADD REPLY • link 7.5 years ago igor ▴ 50

1

Entering edit mode

I think that for a gene-set level analysis you'll be fine with TMM-normalization and logCPM units per gene, which should be used with the Gaussian kernel. On the other hand, in a two recent publications (see Soneson et al., 2015, and Yi et al., 2018) it has been argued that calculating transcript-level estimates of expression, such as TPM units, and using them to obtain gene-level estimates, improves gene-level analysis. From that perspective, maybe one could expect that gene-set level analyses improve as well but as far as I know, this has not been explored. Others in this forum may have more informed opinions about this point.

ADD REPLY • link 7.5 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

I tried running the example you posted. The result is close to yours:

> gsva(normcounts, list(GS1=1:10, GS2=20:30), kcdf="Poisson")
         Sample1    Sample2
GS1 -0.148823642 0.19004598
GS2  0.002574801 0.04970183

I also tried converting to logCPM values and using the Gaussian kernel:

> gsva(cpm(dge, log = TRUE), list(GS1=1:10, GS2=20:30), kcdf="Gaussian")
      Sample1   Sample2
GS1 0.8953349 0.9822134
GS2 0.9471186 0.9289494

The enrichment scores are now much higher. Is that expected?

ADD REPLY • link 7.5 years ago igor ▴ 50

1

Entering edit mode

the code and results above are based on the example of the help page of 'equalizeLibSizes()' from the edgeR package. If you read carefully that example you will see that (1) count values are sampled randomly from two different Poisson distributions, and therefore two different runs will give slightly different results; (2) the generated count data matrix has 2 samples (columns) and this is not enough for GSVA to estimate correctly the gene expression level statistics (see step 1 in figure 1 from Hanzelmann et al., 2013). Modifying the example from the 'equalizeLibSizes()' help page you can get an exactly reproducible example with sufficient sample size as follows:

library(edgeR)
library(GSVA)

nlibs <- 30
counts <- matrix(0,ngenes,nlibs)
set.seed(123) ## set the random seed to facilitate reproducing the numbers
counts[, 1:15] <- rpois(ngenes*15, lambda=10)
counts[, 16:30] <- rpois(ngenes*15, lambda=20)
dge <- DGEList(counts)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)
eqlcounts <- equalizeLibSizes(dge)
normcounts <- ceiling(eqlcounts$pseudo.counts-0.5)
gsvacnt <- gsva(normcounts, list(GS1=1:10, GS2=20:30), kcdf="Poisson")
gsvalogcpm <- gsva(cpm(dge, log = TRUE), list(GS1=1:10, GS2=20:30), kcdf="Gaussian")
cor(gsvacnt["GS1", ], gsvalogcpm["GS1", ])
[1] 0.9799232
cor(gsvacnt["GS2", ], gsvalogcpm["GS2", ])
[1] 0.991606

as you can see, there is a nearly perfect agreement between GSVA scores derived from normalized counts and those derived from logCPM values, calculated from the same data set. This level of agreement may decrease with lower sample size and may also depend on other distributional features from data, since this example is restricted to synthetic Poisson data.

ADD REPLY • link 7.5 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thank you for the explanation. I should've tried more samples. Using the updated example and real samples produces reasonable results.

ADD REPLY • link 7.4 years ago igor ▴ 50

0

Entering edit mode

There is a recent related thread on the GenePattern forum:

You can run ssGSEA using RNA-Seq data. This would generally work if your samples raw counts are not too different from one another. If counts vary too much, you may consider normalizing your data (e.g., so that they are between 0 and 1). Additionally, I would advise you to run your raw RNASeq counts through the module PreprocessReadCounts first. This module was designed to transform RNA-Seq data into the type of distribution that modules such as ssGSEA would expect.

The PreprocessReadCounts module returns logCPM values, so I guess that would be the most appropriate input for ssGSEA.

ADD REPLY • link 7.5 years ago igor ▴ 50