Hi there,
I'm right now stuck with the following problem:
I have a matrix with TPM values called 'expr' and a list of gene sets 'list'. The TPM matrix is not log-transformed but rather directly derived from the salmon output via tximport.
This is my code:
TPM <- tximport(files, type = "salmon", tx2gene = tx2gene, abundanceCol = "TPM") TPMMatrix <- as.data.frame(TPM$abundance) # some reformatting but no data manipulation ... expr <- as.matrix(TPMMatrix) #expr <- log10(expr) res <- gsva(expr, list, rnaseq=FALSE)$es.obs pheatmap(res, main = "GSVA")
I'm now successfully running GSVA, however, I'm not sure whether I understand the rnaseq flag in the right way, it says:
Logical flag set by default to rnaseq=FALSE
to inform whether the input gene expression data are continues values, such as fluorescent units in logarithmic scale from microarray experiments or some other kind of continuous value derived from RNA-seq counts such as log-CPMs, log-RPKMs or log-TPMs. This flag should be set to rnaseq=TRUE
only when the values of the input gene expression data are integer counts.
In addition, I found the following explanation:
The name of the argument rnaseq
can be misleading. When set to rnaseq=FALSE
, the nonparametric estimation of the cumulative density function of the expression profile of each gene across samples is done using Gaussian kernels suited for continuous values. These were initially thought to be only microarray fluorescent units in logarithmic scale but nowadays these may also correspond to continuous values derived from RNA-seq experiments such as log-CPMs, log-RPKMs or log-TPMs. When rnaseq=TRUE
, Poisson kernels are used instead and therefore, this option is only suitable when the input gene expression matrix is formed by integer counts. This implies that rnaseq=FALSE
may also be used even when the expression data comes from a RNA-seq experiment. The name of this argument may change in the future to avoid this confusion.
Am I right, that rnaseq=TRUE should be only set if the TPM values are log transformed? Otherwise, for plain TPM values as I use them, it is not necessary? I'm confused now...
I tried both and the results of cause substantially differ.
Thank you!