Crash GSVA 1.24.0 with bootstrapping
1
0
Entering edit mode
Sander Tan ▴ 20
@sander-tan-12882
Last seen 3.2 years ago
The Hyve, Netherlands

Hi,

I recently updated GSVA to 1.24.0 and I'm still experiencing a crash when I run GSVA with bootstrapping. This is the code I ran, including with urls to RNA-Seq test data and gene set test data.

library(qusage)
library(GSVA)
library(ggplot2)
library(reshape2)

### Input files

### Skin Cutaneous Melanoma RNA Seq from TCGA (http://www.cbioportal.org/study?id=skcm_tcga)
### From https://github.com/cBioPortal/datahub/blob/master/public/skcm_tcga.tar.gz
skcm_expr_file <- "/Users/sander/Desktop/test_gsva/data_RNA_Seq_v2_expression_median.txt"

### Hallmark gene sets from MSigDB
### From http://software.broadinstitute.org/gsea/downloads.jsp
genesets_file <- "/Users/sander/Desktop/test_gsva/h.all.v5.2.entrez.gmt"

### Parameters
n_bootstraps <- 2

### Read gene sets
hallmark_genesets <- read.gmt(genesets_file)

### Read expression
skcm_inp <- read.table(skcm_expr_file, header = T, sep = "\t", quote = "", fill = T, check.names = F)
skcm_inp$Entrez_Gene_Id <- as.factor(skcm_inp$Entrez_Gene_Id)

### Remove NA entrez IDs
skcm_expr_not_na <- skcm_inp[complete.cases(skcm_inp),]

### Remove dup
skcm_expr_not_dub <- skcm_expr_not_na[!duplicated(skcm_expr_not_na$Entrez_Gene_Id),]

### Create only expr data table
skcm_expr <- as.matrix(skcm_expr_not_dub[,3:ncol(skcm_expr_not_dub)])
rownames(skcm_expr) <- skcm_expr_not_dub$Entrez_Gene_Id

### Plot distrubution of reads
skcm_expr_m <- melt(skcm_expr)
colnames(skcm_expr_m) <- c("Gene", "Sample", "Expression")
ggplot(skcm_expr_m, aes(x = Expression)) + geom_histogram()

### Log2 normalize
skcm_expr_norm <- log2(skcm_expr + 1)

### Plot distrubution of reads
skcm_expr_norm_m <- melt(skcm_expr_norm)
colnames(skcm_expr_norm_m) <- c("Gene", "Sample", "Log2_Expression")
ggplot(skcm_expr_norm_m, aes(x = Log2_Expression)) + geom_histogram()

### Run GSVA
gsva_result <- gsva(skcm_expr_norm, hallmark_genesets, method = "gsva")
gsva_scores <- data.frame("geneset_id" = rownames(gsva_result$es.obs), gsva_result$es.obs, check.names = F)
print(gsva_scores[1:5,1:5])

### Run GSVA with bootstrapping (THIS CRASHES)
gsva_result <- gsva(skcm_expr_norm, hallmark_genesets, method = "gsva", no.bootstraps = n_bootstraps)

SessionInfo():

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.2 ggplot2_2.2.1  GSVA_1.24.0    qusage_2.10.0  limma_3.32.1  

loaded via a namespace (and not attached):
[1] zoo_1.8-0            splines_3.4.0        lattice_0.20-35      colorspace_1.3-2     htmltools_0.3.6      stats4_3.4.0         yaml_2.1.14        
[8] survival_2.41-3      XML_3.98-1.6         DBI_0.6-1            BiocGenerics_0.22.0  multcomp_1.4-6       plyr_1.8.4           stringr_1.2.0      
[15] lsmeans_2.25-5       munsell_0.4.3        gtable_0.2.0         mvtnorm_1.0-6        codetools_0.2-15     coda_0.19-1          memoise_1.1.0      
[22] evaluate_0.10        Biobase_2.36.0       knitr_1.15.1         IRanges_2.10.0       parallel_3.4.0       AnnotationDbi_1.38.0 TH.data_1.0-8      
[29] GSEABase_1.38.0      Rcpp_0.12.10         xtable_1.8-2         scales_0.4.1         backports_1.0.5      S4Vectors_0.14.0     graph_1.54.0       
[36] annotate_1.54.0      digest_0.6.12        stringi_1.1.5        grid_3.4.0           rprojroot_1.2        tools_3.4.0          bitops_1.0-6       
[43] sandwich_2.3-4       magrittr_1.5         RCurl_1.95-4.8       lazyeval_0.2.0       RSQLite_1.1-2        tibble_1.3.0         MASS_7.3-47        
[50] Matrix_1.2-10        estimability_1.2     rmarkdown_1.5        nlme_3.1-131         compiler_3.4.0
GSVA bug • 772 views
ADD COMMENT
3
Entering edit mode
Robert Castelo ★ 2.7k
@rcastelo
Last seen 12 weeks ago
Barcelona/Universitat Pompeu Fabra

hi,

thanks for the report and the code reproducing the error. The background problem is that you are giving to the algorithm the whole data set. if you look closely to the output of the GSVA run without bootstrapping, you'll see a warning message saying that a number of genes had constant expression and these were excluded from the calculations. When using bootstrapping, this step of removing genes with constant expression is not done at each bootstrapped subset of the data and genes with constant expression causing the crash. obviously, the software should not crash like that, and i'll submit a fix that handles that circumstance more gracefully. however, as a general advice, you should remove genes that are lowly expressed. Here below you can see a couple of diagnostics about the mean expression per gene.

as you can see, there are two modes in the mean expression per gene (right plot) and genes belonging to the distribution of the first mode are unlikely to be expressed through the samples and therefore, unlikely to inform correctly the algorithm. many of these genes lead to the very low variability tail seen in the left plot where some of them have in fact zero variability, that is, constant expression values across the samples, and this will happen even more often with subsets of the data on genes with little variability.

by discarding lowly expressed genes, i've been able to run GSVA without errors:

mexp <- rowMeans(skcm_expr_norm)
maskexp <- mexp > 5
gsva_result <- gsva(skcm_expr_norm[maskexp, ], hallmark_genesets, method = "gsva", no.bootstraps = n_bootstraps)
gsva_result$bootstrap$es.bootstraps[1:5, 1:5, 1]
            [,1]       [,2]        [,3]        [,4]        [,5]
[1,] -0.40225739  0.3080503  0.30395551 -0.27946811 -0.13865527
[2,] -0.19911316  0.0309533  0.11158545 -0.04340616 -0.14620831
[3,]  0.29779462  0.1730939 -0.03801008  0.44148471 -0.11736234
[4,] -0.08859426 -0.1994489 -0.36246121  0.12199576  0.09049742
[5,] -0.39549558 -0.3374862 -0.03466244  0.25082259 -0.09520560

cheers,

robert.

 

 

ADD COMMENT
0
Entering edit mode

Thanks Robert for the excellent solution and advice. This also explains why I didn't experience the issue with a dataset that was based on a different kind of data.

ADD REPLY

Login before adding your answer.

Traffic: 250 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6