Set rnaseq locial flag during GSVA
1
0
Entering edit mode
@leiendeckerlu-13702
Last seen 4.1 years ago

Hi there,

I'm right now stuck with the following problem:

I have a matrix with TPM values called 'expr' and a list of gene sets 'list'. The TPM matrix is not log-transformed but rather directly derived from the salmon output via tximport.

This is my code:

TPM <- tximport(files, type = "salmon", tx2gene = tx2gene, abundanceCol = "TPM")
TPMMatrix <- as.data.frame(TPM$abundance) # some reformatting but no data manipulation ... expr <- as.matrix(TPMMatrix) #expr <- log10(expr) res <- gsva(expr, list, rnaseq=FALSE)$es.obs

pheatmap(res, main = "GSVA")

I'm now successfully running GSVA, however, I'm not sure whether I understand the rnaseq flag in the right way, it says:

Logical flag set by default to rnaseq=FALSE to inform whether the input gene expression data are continues values, such as fluorescent units in logarithmic scale from microarray experiments or some other kind of continuous value derived from RNA-seq counts such as log-CPMs, log-RPKMs or log-TPMs. This flag should be set to rnaseq=TRUE only when the values of the input gene expression data are integer counts.

In addition, I found the following explanation:

The name of the argument rnaseq can be misleading. When set to rnaseq=FALSE, the nonparametric estimation of the cumulative density function of the expression profile of each gene across samples is done using Gaussian kernels suited for continuous values. These were initially thought to be only microarray fluorescent units in logarithmic scale but nowadays these may also correspond to continuous values derived from RNA-seq experiments such as log-CPMs, log-RPKMs or log-TPMs. When rnaseq=TRUE, Poisson kernels are used instead and therefore, this option is only suitable when the input gene expression matrix is formed by integer counts. This implies that rnaseq=FALSE may also be used even when the expression data comes from a RNA-seq experiment. The name of this argument may change in the future to avoid this confusion.

Am I right, that rnaseq=TRUE should be only set if the TPM values are log transformed? Otherwise, for plain TPM values as I use them, it is not necessary? I'm confused now...

I tried both and the results of cause substantially differ.

Thank you!

gsva rnaseq tpm • 838 views
0
Entering edit mode
Robert Castelo ★ 2.7k
@rcastelo
Last seen 12 weeks ago
Barcelona/Universitat Pompeu Fabra

Hi, sorry for the very long delay in answering your question. I see that the documentation may be confusing with respect to the necessity or not of the logarithmic scale but the crucial point made by the documentation is that rnaseq=TRUE "is only suitable when the input gene expression matrix is formed by integer counts", otherwise you should set rnaseq=FALSE. Because CPMs, RPKMs and TPMs are not integer counts, you should set rnaseq=FALSE with these quantities.

The option 'rnaseq' will be deprecated in the forthcoming release of October 2017 and replaced by a new option called 'kcdf' which will allow the user to specify the type of kernel employed during the nonparametric estimation of the cumulative density function of the expression profile, which is the step that needs to be informed about whether the input expression data are integer counts of not. By default in the forthcoming version of GSVA, 'kcdf="Gaussian', which means input expression values are continuous and other possible values will be 'kcdf="Poisson"' for integer counts or 'kcdf="none"' to skip this step for experimental purposes.

cheers,

robert.