We run an involved process, repeatedly for different genes, using BiocParallel. With SerialParams or MultiCorpParam(workers=2), the jobs - no matter how many - run to completion. With workers > 2, we see consistent failure with the rather cryptic error message
"database disk image is malformed"
Our primary database use is read-only operations on a local Postgres DNAse footprint database.
A: foreach, doMC, and GOstats problems (from March 2011)
suggests that even read-only access to a database via DBI encounters corruption ("apparent corruption" might be more accurate, because the actual database is intact) with more than 2 workers.
Before we refactor our code, probably by extracting the database calls into a prior serial or 2-worker run, maybe others with similar experience have suggestions?
The bptry(bpapply log on the one failed process:
############### LOG OUTPUT ############### Task: 1 Node: 1 Timestamp: 2018-06-29 17:49:27 Success: TRUE Task duration: user system elapsed 10.200 0.976 11.778 Memory used: used (Mb) gc trigger (Mb) max used (Mb) Ncells 8451553 451.4 14149528 755.7 10760405 574.7 Vcells 23116892 176.4 44160900 337.0 27363801 208.8 Log messages:INFO [2018-06-29 17:49:15] runSGM on TREM2, ENSG00000095970 INFO [2018-06-29 17:49:15] assigning regions in mode tiny INFO [2018-06-29 17:49:15] assigning regions in mode tiny, 1 regions INFO [2018-06-29 17:49:15] constructing trenaSGM INFO [2018-06-29 17:49:15] after ctor INFO [2018-06-29 17:49:15] after build.spec assignement INFO [2018-06-29 17:49:15] calling calculate INFO [2018-06-29 17:49:26] calculate complete INFO [2018-06-29 17:49:26] saving model for ENSG00000095970 (TREM2): 12 tfs INFO [2018-06-29 17:49:26] save complete stderr and stdout: [1] -- runSGM(ENSG00000095970) [1] trenaSGM::calculate building one of type footprint.database [1] --- opening connection brain_hint_20 [1] --- querying brain_hint_20 for footprints across 1 regions totaling 2000 bases [1] combined tbl.fp: 2061 17 [1] tf candidate count, in mtx, in tbl.regulatory.regions: 12/12
Log for a representative run of the five which failed:
############### LOG OUTPUT ############### Task: 2 Node: 2 Timestamp: 2018-06-29 17:49:16 Success: FALSE Task duration: user system elapsed 0.192 0.052 0.247 Memory used: used (Mb) gc trigger (Mb) max used (Mb) Ncells 8451391 451.4 14149528 755.7 10760405 574.7 Vcells 23114227 176.4 44160900 337.0 27363801 208.8 Log messages:INFO [2018-06-29 17:49:15] runSGM on EPHA1, ENSG00000146904 INFO [2018-06-29 17:49:15] assigning regions in mode tiny INFO [2018-06-29 17:49:15] assigning regions in mode tiny, 1 regions INFO [2018-06-29 17:49:15] constructing trenaSGM INFO [2018-06-29 17:49:15] after ctor ERROR [2018-06-29 17:49:15] database disk image is malformed
stderr and stdout:
[1] -- runSGM(ENSG00000146904)
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /local/users/pshannon/local/lib/R/lib/libRblas.so
LAPACK: /local/users/pshannon/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] BatchJobs_1.7 BBmisc_1.11 futile.logger_1.4.3
[4] RPostgreSQL_0.6-2 DBI_1.0.0 BiocParallel_1.14.1
[7] tibble_1.4.2 biomaRt_2.36.1 motifStack_1.24.0
[10] ade4_1.7-11 MotIV_1.36.0 grImport_0.9-1
[13] XML_3.98-1.11 trenaSGM_0.99.31 org.Hs.eg.db_3.6.0
[16] AnnotationDbi_1.42.1 Biobase_2.40.0 trena_1.3.4
[19] MotifDb_1.23.9 Biostrings_2.48.0 XVector_0.20.0
[22] glmnet_2.0-16 foreach_1.4.4 Matrix_1.2-14
[25] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0 IRanges_2.14.10
[28] S4Vectors_0.18.3 BiocGenerics_0.26.0 RUnit_0.4.32
Thank you, Martin. The ipclock is an elegant solution - but might not be needed after all. Here is an example where, with atomic dbConnect, dbGetQuery and dbDisconnect, any number of workers can query the database without conflict.
Since this works at every scale I have so far tried (many more rows, many more calls), postgres and DBI do not appear, in themselves, to be the source of the "database image is malformed" problem. Which makes sense: postgres is a tried-and-true multiuser database.
Debugging suggestions welcome. All I can think of, by way of figuring this out, is to add progressively more code to the parallelized function until the "malformed image" error reappears.
yep I don't think there is any debugging magic here other than to produce a minimal reproducible example. happy to help when you arrive at something that I can work on...
Hi Martin,
Thanks for your offer. Here is a minimal example: 9 lines, a few of which are just for set up.
In brief: read-only access via select to org.Hs.eg.db, with larg-ish query keys, frequently (but not always) produces the error, when more than two workers are requested:
database disk image is malformed
I can restructure my code to avoid calling select in the parallelized code, supplying a sufficiently comprehensive identifier map by other means. But perhaps this is not so hard to fix in the library?
The crucial code below is the lookup function. The get and mget calls are merely setup, giving us an easily reproduced list of 1551 gene symbols. The somewhat contrived task here is to convert the long lists of gene symbols to lists of ENSEMBL gene ids. To keep the set up simple, all of my three gene symbol lists are the same 1551 genes.
Error: BiocParallel errors
element index: 1, 2, 3
first error: database disk image is malformed
The problem here is that
library(org.Hs.eg.db)
opens a database connection in the main thread, and then the workers use it independently in the worker threads via select(). The solution is like your Postgres example, to open (and close) the connection on the workers. Something likeFor what it's worth the 'old school' join can be replaced by (a sluggish)
Thanks, Martin. Funny thing, all these years in and I did not know that the org db files were in-memory sqlite databases - with not only implicit connect and disconnect, but explicit versions also. Seeing that in your example I will now code appropriately.