Question: Crash GSVA 1.24.0 with bootstrapping
0
gravatar for Sander Tan
2.4 years ago by
Sander Tan20
The Hyve, Netherlands
Sander Tan20 wrote:

Hi,

I recently updated GSVA to 1.24.0 and I'm still experiencing a crash when I run GSVA with bootstrapping. This is the code I ran, including with urls to RNA-Seq test data and gene set test data.

library(qusage)
library(GSVA)
library(ggplot2)
library(reshape2)

### Input files

### Skin Cutaneous Melanoma RNA Seq from TCGA (http://www.cbioportal.org/study?id=skcm_tcga)
### From https://github.com/cBioPortal/datahub/blob/master/public/skcm_tcga.tar.gz
skcm_expr_file <- "/Users/sander/Desktop/test_gsva/data_RNA_Seq_v2_expression_median.txt"

### Hallmark gene sets from MSigDB
### From http://software.broadinstitute.org/gsea/downloads.jsp
genesets_file <- "/Users/sander/Desktop/test_gsva/h.all.v5.2.entrez.gmt"

### Parameters
n_bootstraps <- 2

### Read gene sets
hallmark_genesets <- read.gmt(genesets_file)

### Read expression
skcm_inp <- read.table(skcm_expr_file, header = T, sep = "\t", quote = "", fill = T, check.names = F)
skcm_inp$Entrez_Gene_Id <- as.factor(skcm_inp$Entrez_Gene_Id)

### Remove NA entrez IDs
skcm_expr_not_na <- skcm_inp[complete.cases(skcm_inp),]

### Remove dup
skcm_expr_not_dub <- skcm_expr_not_na[!duplicated(skcm_expr_not_na$Entrez_Gene_Id),]

### Create only expr data table
skcm_expr <- as.matrix(skcm_expr_not_dub[,3:ncol(skcm_expr_not_dub)])
rownames(skcm_expr) <- skcm_expr_not_dub$Entrez_Gene_Id

### Plot distrubution of reads
skcm_expr_m <- melt(skcm_expr)
colnames(skcm_expr_m) <- c("Gene", "Sample", "Expression")
ggplot(skcm_expr_m, aes(x = Expression)) + geom_histogram()

### Log2 normalize
skcm_expr_norm <- log2(skcm_expr + 1)

### Plot distrubution of reads
skcm_expr_norm_m <- melt(skcm_expr_norm)
colnames(skcm_expr_norm_m) <- c("Gene", "Sample", "Log2_Expression")
ggplot(skcm_expr_norm_m, aes(x = Log2_Expression)) + geom_histogram()

### Run GSVA
gsva_result <- gsva(skcm_expr_norm, hallmark_genesets, method = "gsva")
gsva_scores <- data.frame("geneset_id" = rownames(gsva_result$es.obs), gsva_result$es.obs, check.names = F)
print(gsva_scores[1:5,1:5])

### Run GSVA with bootstrapping (THIS CRASHES)
gsva_result <- gsva(skcm_expr_norm, hallmark_genesets, method = "gsva", no.bootstraps = n_bootstraps)

SessionInfo():

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.2 ggplot2_2.2.1  GSVA_1.24.0    qusage_2.10.0  limma_3.32.1  

loaded via a namespace (and not attached):
[1] zoo_1.8-0            splines_3.4.0        lattice_0.20-35      colorspace_1.3-2     htmltools_0.3.6      stats4_3.4.0         yaml_2.1.14        
[8] survival_2.41-3      XML_3.98-1.6         DBI_0.6-1            BiocGenerics_0.22.0  multcomp_1.4-6       plyr_1.8.4           stringr_1.2.0      
[15] lsmeans_2.25-5       munsell_0.4.3        gtable_0.2.0         mvtnorm_1.0-6        codetools_0.2-15     coda_0.19-1          memoise_1.1.0      
[22] evaluate_0.10        Biobase_2.36.0       knitr_1.15.1         IRanges_2.10.0       parallel_3.4.0       AnnotationDbi_1.38.0 TH.data_1.0-8      
[29] GSEABase_1.38.0      Rcpp_0.12.10         xtable_1.8-2         scales_0.4.1         backports_1.0.5      S4Vectors_0.14.0     graph_1.54.0       
[36] annotate_1.54.0      digest_0.6.12        stringi_1.1.5        grid_3.4.0           rprojroot_1.2        tools_3.4.0          bitops_1.0-6       
[43] sandwich_2.3-4       magrittr_1.5         RCurl_1.95-4.8       lazyeval_0.2.0       RSQLite_1.1-2        tibble_1.3.0         MASS_7.3-47        
[50] Matrix_1.2-10        estimability_1.2     rmarkdown_1.5        nlme_3.1-131         compiler_3.4.0
gsva bug • 537 views
ADD COMMENTlink modified 2.4 years ago by Robert Castelo2.3k • written 2.4 years ago by Sander Tan20
Answer: Crash GSVA 1.24.0 with bootstrapping
3
gravatar for Robert Castelo
2.4 years ago by
Robert Castelo2.3k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.3k wrote:

hi,

thanks for the report and the code reproducing the error. The background problem is that you are giving to the algorithm the whole data set. if you look closely to the output of the GSVA run without bootstrapping, you'll see a warning message saying that a number of genes had constant expression and these were excluded from the calculations. When using bootstrapping, this step of removing genes with constant expression is not done at each bootstrapped subset of the data and genes with constant expression causing the crash. obviously, the software should not crash like that, and i'll submit a fix that handles that circumstance more gracefully. however, as a general advice, you should remove genes that are lowly expressed. Here below you can see a couple of diagnostics about the mean expression per gene.

as you can see, there are two modes in the mean expression per gene (right plot) and genes belonging to the distribution of the first mode are unlikely to be expressed through the samples and therefore, unlikely to inform correctly the algorithm. many of these genes lead to the very low variability tail seen in the left plot where some of them have in fact zero variability, that is, constant expression values across the samples, and this will happen even more often with subsets of the data on genes with little variability.

by discarding lowly expressed genes, i've been able to run GSVA without errors:

mexp <- rowMeans(skcm_expr_norm)
maskexp <- mexp > 5
gsva_result <- gsva(skcm_expr_norm[maskexp, ], hallmark_genesets, method = "gsva", no.bootstraps = n_bootstraps)
gsva_result$bootstrap$es.bootstraps[1:5, 1:5, 1]
            [,1]       [,2]        [,3]        [,4]        [,5]
[1,] -0.40225739  0.3080503  0.30395551 -0.27946811 -0.13865527
[2,] -0.19911316  0.0309533  0.11158545 -0.04340616 -0.14620831
[3,]  0.29779462  0.1730939 -0.03801008  0.44148471 -0.11736234
[4,] -0.08859426 -0.1994489 -0.36246121  0.12199576  0.09049742
[5,] -0.39549558 -0.3374862 -0.03466244  0.25082259 -0.09520560

cheers,

robert.

 

 

ADD COMMENTlink written 2.4 years ago by Robert Castelo2.3k

Thanks Robert for the excellent solution and advice. This also explains why I didn't experience the issue with a dataset that was based on a different kind of data.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Sander Tan20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 110 users visited in the last hour