Question

Crash GSVA 1.24.0 with bootstrapping

0

Entering edit mode

Sander Tan ▴ 20

@sander-tan-12882

Last seen 6.7 years ago

The Hyve, Netherlands

Hi,

I recently updated GSVA to 1.24.0 and I'm still experiencing a crash when I run GSVA with bootstrapping. This is the code I ran, including with urls to RNA-Seq test data and gene set test data.

library(qusage)
library(GSVA)
library(ggplot2)
library(reshape2)

### Input files

### Skin Cutaneous Melanoma RNA Seq from TCGA (http://www.cbioportal.org/study?id=skcm_tcga)
### From https://github.com/cBioPortal/datahub/blob/master/public/skcm_tcga.tar.gz
skcm_expr_file <- "/Users/sander/Desktop/test_gsva/data_RNA_Seq_v2_expression_median.txt"

### Hallmark gene sets from MSigDB
### From http://software.broadinstitute.org/gsea/downloads.jsp
genesets_file <- "/Users/sander/Desktop/test_gsva/h.all.v5.2.entrez.gmt"

### Parameters
n_bootstraps <- 2

### Read gene sets
hallmark_genesets <- read.gmt(genesets_file)

### Read expression
skcm_inp <- read.table(skcm_expr_file, header = T, sep = "\t", quote = "", fill = T, check.names = F)
skcm_inp$Entrez_Gene_Id <- as.factor(skcm_inp$Entrez_Gene_Id)

### Remove NA entrez IDs
skcm_expr_not_na <- skcm_inp[complete.cases(skcm_inp),]

### Remove dup
skcm_expr_not_dub <- skcm_expr_not_na[!duplicated(skcm_expr_not_na$Entrez_Gene_Id),]

### Create only expr data table
skcm_expr <- as.matrix(skcm_expr_not_dub[,3:ncol(skcm_expr_not_dub)])
rownames(skcm_expr) <- skcm_expr_not_dub$Entrez_Gene_Id

### Plot distrubution of reads
skcm_expr_m <- melt(skcm_expr)
colnames(skcm_expr_m) <- c("Gene", "Sample", "Expression")
ggplot(skcm_expr_m, aes(x = Expression)) + geom_histogram()

### Log2 normalize
skcm_expr_norm <- log2(skcm_expr + 1)

### Plot distrubution of reads
skcm_expr_norm_m <- melt(skcm_expr_norm)
colnames(skcm_expr_norm_m) <- c("Gene", "Sample", "Log2_Expression")
ggplot(skcm_expr_norm_m, aes(x = Log2_Expression)) + geom_histogram()

### Run GSVA
gsva_result <- gsva(skcm_expr_norm, hallmark_genesets, method = "gsva")
gsva_scores <- data.frame("geneset_id" = rownames(gsva_result$es.obs), gsva_result$es.obs, check.names = F)
print(gsva_scores[1:5,1:5])

### Run GSVA with bootstrapping (THIS CRASHES)
gsva_result <- gsva(skcm_expr_norm, hallmark_genesets, method = "gsva", no.bootstraps = n_bootstraps)

SessionInfo():

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.2 ggplot2_2.2.1  GSVA_1.24.0    qusage_2.10.0  limma_3.32.1  

loaded via a namespace (and not attached):
[1] zoo_1.8-0            splines_3.4.0        lattice_0.20-35      colorspace_1.3-2     htmltools_0.3.6      stats4_3.4.0         yaml_2.1.14        
[8] survival_2.41-3      XML_3.98-1.6         DBI_0.6-1            BiocGenerics_0.22.0  multcomp_1.4-6       plyr_1.8.4           stringr_1.2.0      
[15] lsmeans_2.25-5       munsell_0.4.3        gtable_0.2.0         mvtnorm_1.0-6        codetools_0.2-15     coda_0.19-1          memoise_1.1.0      
[22] evaluate_0.10        Biobase_2.36.0       knitr_1.15.1         IRanges_2.10.0       parallel_3.4.0       AnnotationDbi_1.38.0 TH.data_1.0-8      
[29] GSEABase_1.38.0      Rcpp_0.12.10         xtable_1.8-2         scales_0.4.1         backports_1.0.5      S4Vectors_0.14.0     graph_1.54.0       
[36] annotate_1.54.0      digest_0.6.12        stringi_1.1.5        grid_3.4.0           rprojroot_1.2        tools_3.4.0          bitops_1.0-6       
[43] sandwich_2.3-4       magrittr_1.5         RCurl_1.95-4.8       lazyeval_0.2.0       RSQLite_1.1-2        tibble_1.3.0         MASS_7.3-47        
[50] Matrix_1.2-10        estimability_1.2     rmarkdown_1.5        nlme_3.1-131         compiler_3.4.0

GSVA bug • 1.7k views

ADD COMMENT • link updated 7.9 years ago by Robert Castelo ★ 3.4k • written 7.9 years ago by Sander Tan ▴ 20

score 3 · Accepted Answer · 2017-05-02

hi,

thanks for the report and the code reproducing the error. The background problem is that you are giving to the algorithm the whole data set. if you look closely to the output of the GSVA run without bootstrapping, you'll see a warning message saying that a number of genes had constant expression and these were excluded from the calculations. When using bootstrapping, this step of removing genes with constant expression is not done at each bootstrapped subset of the data and genes with constant expression causing the crash. obviously, the software should not crash like that, and i'll submit a fix that handles that circumstance more gracefully. however, as a general advice, you should remove genes that are lowly expressed. Here below you can see a couple of diagnostics about the mean expression per gene.

as you can see, there are two modes in the mean expression per gene (right plot) and genes belonging to the distribution of the first mode are unlikely to be expressed through the samples and therefore, unlikely to inform correctly the algorithm. many of these genes lead to the very low variability tail seen in the left plot where some of them have in fact zero variability, that is, constant expression values across the samples, and this will happen even more often with subsets of the data on genes with little variability.

by discarding lowly expressed genes, i've been able to run GSVA without errors:

mexp <- rowMeans(skcm_expr_norm)
maskexp <- mexp > 5
gsva_result <- gsva(skcm_expr_norm[maskexp, ], hallmark_genesets, method = "gsva", no.bootstraps = n_bootstraps)
gsva_result$bootstrap$es.bootstraps[1:5, 1:5, 1]
            [,1]       [,2]        [,3]        [,4]        [,5]
[1,] -0.40225739  0.3080503  0.30395551 -0.27946811 -0.13865527
[2,] -0.19911316  0.0309533  0.11158545 -0.04340616 -0.14620831
[3,]  0.29779462  0.1730939 -0.03801008  0.44148471 -0.11736234
[4,] -0.08859426 -0.1994489 -0.36246121  0.12199576  0.09049742
[5,] -0.39549558 -0.3374862 -0.03466244  0.25082259 -0.09520560

cheers,

robert.