Question

Running time of piano's runGSA

0

Entering edit mode

rubi ▴ 110

@rubi-6462

Last seen 5.7 years ago

Hi,

I'm running piano's runGSA on a list of 9881 genes (with directional fold-changes) and 13244 GO BP gene sets and it takes ~30 min to complete. I'm using the default geneSetStat option and all other arguments are at default values:

Final gene/gene-set association: 9881 genes and 13244 gene-sets Details: Calculating gene set statistics from 9881 out of 9881 gene-level statistics Using all 9881 gene-level statistics for significance estimation Removed 0 genes from GSC due to lack of matching gene statistics Removed 0 gene sets containing no genes after gene removal Removed additionally 0 gene sets not matching the size limits Loaded additional information for 0 gene sets

Gene statistic type: F-like Method: mean Gene-set statistic name: mean Significance: Gene sampling Adjustment: fdr Gene set size limit: (1,Inf) Permutations: 10000 Total run time: 29.75 min

In contrast, if I upload this genes list to the GORILLA GO enrichment analysis website at: http://cbl-gorilla.cs.technion.ac.il/ i takes a couple of seconds. And, the order of magnitude of the p-values is not smaller.

Also, I'm not sure way all pDistinctDirUp and pDistinctDirDown are NAs.

> sessionInfo()

R version 3.3.1 (2016-06-21) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.12.1 (Sierra)

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] grid parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] snpEnrichment_1.7.0 BiocInstaller_1.24.0 dplyr_0.5.0 piano_1.14.5 Gviz_1.18.1 GenomicRanges_1.26.1 GenomeInfoDb_1.10.2 IRanges_2.8.1 [9] S4Vectors_0.12.0 BiocGenerics_0.20.0

loaded via a namespace (and not attached): [1] bitops_1.0-6 matrixStats_0.51.0 RColorBrewer_1.1-2 httr_1.2.1 data.tree_0.6.2 tools_3.3.1 [7] R6_2.2.0 rpart_4.1-10 KernSmooth_2.23-15 Hmisc_4.0-2 DBI_0.5-1 lazyeval_0.2.0 [13] colorspace_1.2-7 nnet_7.3-12 gridExtra_2.2.1 chron_2.3-47 Biobase_2.34.0 htmlTable_1.7 [19] influenceR_0.1.0 slam_0.1-40 rtracklayer_1.34.1 caTools_1.17.1 scales_0.4.1 relations_0.6-6 [25] stringr_1.1.0 digest_0.6.10 Rsamtools_1.26.1 foreign_0.8-67 XVector_0.14.0 base64enc_0.1-3 [31] dichromat_2.0-0 htmltools_0.3.5 ensembldb_1.6.2 limma_3.30.2 BSgenome_1.42.0 htmlwidgets_0.8 [37] rstudioapi_0.6 RSQLite_1.0.0 shiny_0.14.2 visNetwork_1.0.3 jsonlite_1.1 BiocParallel_1.8.1 [43] gtools_3.5.0 acepack_1.4.1 rgexf_0.15.3 VariantAnnotation_1.20.2 RCurl_1.95-4.8 magrittr_1.5 [49] Formula_1.2-1 Matrix_1.2-7.1 Rcpp_0.12.7 munsell_0.4.3 viridis_0.3.4 stringi_1.1.2 [55] yaml_2.1.14 SummarizedExperiment_1.4.0 zlibbioc_1.20.0 gplots_3.0.1 plyr_1.8.4 AnnotationHub_2.6.4 [61] gdata_2.17.0 snpStats_1.24.0 lattice_0.20-34 Biostrings_2.42.0 splines_3.3.1 GenomicFeatures_1.26.2 [67] knitr_1.15.1 fgsea_1.0.1 igraph_1.0.1 marray_1.52.0 biomaRt_2.30.0 fastmatch_1.0-4 [73] XML_3.98-1.5 biovizBase_1.22.0 latticeExtra_0.6-28 data.table_1.9.6 httpuv_1.3.3 gtable_0.2.0 [79] assertthat_0.1 ggplot2_2.2.1 mime_0.5 xtable_1.8-2 survival_2.40-1 tibble_1.2 [85] GenomicAlignments_1.10.0 AnnotationDbi_1.36.0 sets_1.0-16 cluster_2.0.5 Rook_1.1-1 DiagrammeR_0.9.0 [91] brew_1.0-6 interactiveDisplayBase_1.12.0

piano runGSA runtime • 1.3k views

ADD COMMENT • link updated 7.3 years ago by Leif Väremo ▴ 70 • written 7.3 years ago by rubi ▴ 110

0

Entering edit mode

Hi, could you clarify this part: "pMixedDirUp is anti-correlated with pMixedDirUp. I'm guessing the p-value is really 1-pMixedDirUp. This is not true for pMixedDirDown. Is this a bug?"
Is there a typo in one of the pMixedDirUp? I guess you mean something else?

ADD REPLY • link 7.3 years ago Leif Väremo ▴ 70

0

Entering edit mode

Could you also clarify what input you are using? The run output indicates that your gene-level statistics are in the range [0,Inf] (are they maybe ranks?) but you also mention directional fold-changes, so I am not sure...

ADD REPLY • link 7.3 years ago Leif Väremo ▴ 70

0

Entering edit mode

Sorry about the lack of clarity. I dropped the part of the anti correlation between the statMixedDirUp and pMixedDirUp. My question is only about the run-time, which I guess is not solvable.

ADD REPLY • link 7.3 years ago rubi ▴ 110

score 2 · Accepted Answer · 2017-01-18

The runtime of piano for datasets with a large number of genes and gene-sets is unfortunately slow due to the permutation steps (GORILLA uses a different approach without permutations). It is possible to speed it up by settling for fewer than the default 10,000 permutations (nPerm), or by using the ncpus argument to parallelize the computations. You could try the fgsea method which is very fast. The fgsea method should yield similar or identical results as the fgsea package (piano imports functions from the fgsea package). Just to clarify, piano assumes to receive gene-level statistics that correspond to e.g. fold-changes, so that, if sorted, up-regulated genes would appear on the top whereas down-regulated genes would appear on the bottom.

If the gene-level statistics are "F-like" (as indicated in your case), i.e. ranging from 0 to Inf and with a higher value meaning a "better" score (note that using ranks will not work since the number 1 ranked gene will be interpreted as least important), only the non-directional and mixed-directional p-values will be calculated. Distinct-directional p-values require gene-level statistics that range from negative to positive values. This is because "F-like" statistics do not carry any information about direction. However, if fold-changes are supplied in the 'directions' argument, piano will subset the genes into up- and down-regulated, and hence calculate the mixed-directional scores. This is why NAs are given.

Hope this helps...