Question

BiocParallel: database disk image is malformed

1

Entering edit mode

Paul Shannon ▴ 470

@paul-shannon-5944

Last seen 3.8 years ago

United States

We run an involved process, repeatedly for different genes, using BiocParallel. With SerialParams or MultiCorpParam(workers=2), the jobs - no matter how many - run to completion. With workers > 2, we see consistent failure with the rather cryptic error message

"database disk image is malformed"

Our primary database use is read-only operations on a local Postgres DNAse footprint database.

A: foreach, doMC, and GOstats problems (from March 2011)

suggests that even read-only access to a database via DBI encounters corruption ("apparent corruption" might be more accurate, because the actual database is intact) with more than 2 workers.

Before we refactor our code, probably by extracting the database calls into a prior serial or 2-worker run, maybe others with similar experience have suggestions?

The bptry(bpapply log on the one failed process:

############### LOG OUTPUT ###############

Task: 1
Node: 1
Timestamp: 2018-06-29 17:49:27
Success: TRUE
Task duration:
   user  system elapsed
10.200   0.976  11.778
Memory used:
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  8451553 451.4   14149528 755.7 10760405 574.7
Vcells 23116892 176.4   44160900 337.0 27363801 208.8
Log messages:INFO [2018-06-29 17:49:15] runSGM on TREM2, ENSG00000095970
INFO [2018-06-29 17:49:15]   assigning regions in mode tiny
INFO [2018-06-29 17:49:15]   assigning regions in mode tiny, 1 regions
INFO [2018-06-29 17:49:15] constructing trenaSGM
INFO [2018-06-29 17:49:15]    after ctor
INFO [2018-06-29 17:49:15]    after build.spec assignement
INFO [2018-06-29 17:49:15] calling calculate
INFO [2018-06-29 17:49:26] calculate complete
INFO [2018-06-29 17:49:26] saving model for ENSG00000095970 (TREM2): 12 tfs
INFO [2018-06-29 17:49:26] save complete

stderr and stdout:
[1] -- runSGM(ENSG00000095970)
[1]  trenaSGM::calculate building one of type footprint.database
[1] --- opening connection brain_hint_20
[1] --- querying brain_hint_20 for footprints across 1 regions totaling 2000 bases
[1]  combined tbl.fp: 2061 17
[1] tf candidate count, in mtx, in tbl.regulatory.regions: 12/12

Log for a representative run of the five which failed:

############### LOG OUTPUT ###############
Task: 2
Node: 2
Timestamp: 2018-06-29 17:49:16
Success: FALSE
Task duration:
   user  system elapsed
  0.192   0.052   0.247
Memory used:
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  8451391 451.4   14149528 755.7 10760405 574.7
Vcells 23114227 176.4   44160900 337.0 27363801 208.8
Log messages:INFO [2018-06-29 17:49:15] runSGM on EPHA1, ENSG00000146904
INFO [2018-06-29 17:49:15]   assigning regions in mode tiny
INFO [2018-06-29 17:49:15]   assigning regions in mode tiny, 1 regions
INFO [2018-06-29 17:49:15] constructing trenaSGM
INFO [2018-06-29 17:49:15]    after ctor
ERROR [2018-06-29 17:49:15] database disk image is malformed

stderr and stdout:

[1] -- runSGM(ENSG00000146904)

sessionInfo()

R version 3.5.0 (2018-04-23)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 16.04.4 LTS

Matrix products: default

BLAS: /local/users/pshannon/local/lib/R/lib/libRblas.so

LAPACK: /local/users/pshannon/local/lib/R/lib/libRlapack.so

locale:

[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C

[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8

[7] LC_PAPER=en_US.UTF-8 LC_NAME=C

[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] grid parallel stats4 stats graphics grDevices utils

[8] datasets methods base

other attached packages:

[1] BatchJobs_1.7 BBmisc_1.11 futile.logger_1.4.3

[4] RPostgreSQL_0.6-2 DBI_1.0.0 BiocParallel_1.14.1

[7] tibble_1.4.2 biomaRt_2.36.1 motifStack_1.24.0

[10] ade4_1.7-11 MotIV_1.36.0 grImport_0.9-1

[13] XML_3.98-1.11 trenaSGM_0.99.31 org.Hs.eg.db_3.6.0

[16] AnnotationDbi_1.42.1 Biobase_2.40.0 trena_1.3.4

[19] MotifDb_1.23.9 Biostrings_2.48.0 XVector_0.20.0

[22] glmnet_2.0-16 foreach_1.4.4 Matrix_1.2-14

[25] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0 IRanges_2.14.10

[28] S4Vectors_0.18.3 BiocGenerics_0.26.0 RUnit_0.4.32

biocparallel databases • 3.6k views

ADD COMMENT • link updated 7.6 years ago by Martin Morgan 25k • written 7.6 years ago by Paul Shannon ▴ 470

score 2 · Answer 1 · 2018-06-29

2

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 20 days ago

United States

It seems like there are two possibilities.

Suppose a connection to the db is created on the master, and then the workers independently use the connection. 'A' accesses the database and the connection is modified to indicated the updated state of the database. 'B' then accesses the database with a connection that supposes the database in the state before 'A' accessed it, but actually it is in the state after 'A' accessed it. A solution might be to open the database connection on the workers rather than the manager, close to when the database is actually used. I think this is the solution in the thread you mention, and I think the most likely.

It might be that the database itself is not smart enough to handle multiple simultaneous reads. You could try to use a 'lock' to ensure that only one process is accessing the database at a time. Perhaps this would be used in conjunction with the first solution, so that the lock surrounds the establishment of the connection, the transaction, and the termination of the connection. BiocParallel provides an inter-process lock (i.e., for threads or processes on the same computer). The idea is to create a unique identifier

id <- ipcid()

and to use that to ensure only one process at a time is accessing the database, e.g,.

bplapply(1:5, function(i, id) {
    ## work unrelated to the DB, then... 
    BiocParallel::ipclock(id) # only one process at a time
    ## open, transact, close DB 
    BiocParallel::ipcunlock(id) # ok for next process 
    ## more work unrelated to the DB 
})

the help page ?ipcid example has more illustrations.

ADD COMMENT • link 7.6 years ago Martin Morgan 25k

0

Entering edit mode

Thank you, Martin. The ipclock is an elegant solution - but might not be needed after all. Here is an example where, with atomic dbConnect, dbGetQuery and dbDisconnect, any number of workers can query the database without conflict.

library(BiocParallel)
library(RPostgreSQL)
dbcall <- function(rows){
  database.host <- "bddsrds.globusgenomics.org"
  dbName <- "brain_hint_20"
  db.fp <- dbConnect(PostgreSQL(), 
                     user="trena", password="trena", 
                     dbname=dbName, host=database.host)
  query <- sprintf("select * from hits limit %d", rows)
  tbl <- dbGetQuery(db.fp, query)
  dbDisconnect(db.fp)
  return(tbl)
  }

row.counts <- sample(1:1000, 100)  # 86 workers on our linux box.
system.time(results <- bplapply(row.counts, dbcall,
                                BPPARAM=MulticoreParam()))
fivenum(unlist(lapply(results, nrow))) # 13.0 277.0 521.0 759.5 993.0

Since this works at every scale I have so far tried (many more rows, many more calls), postgres and DBI do not appear, in themselves, to be the source of the "database image is malformed" problem. Which makes sense: postgres is a tried-and-true multiuser database.

Debugging suggestions welcome. All I can think of, by way of figuring this out, is to add progressively more code to the parallelized function until the "malformed image" error reappears.

ADD REPLY • link 7.6 years ago Paul Shannon ▴ 470

0

Entering edit mode

yep I don't think there is any debugging magic here other than to produce a minimal reproducible example. happy to help when you arrive at something that I can work on...

ADD REPLY • link 7.6 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin,

Thanks for your offer. Here is a minimal example: 9 lines, a few of which are just for set up.

In brief: read-only access via select to org.Hs.eg.db, with larg-ish query keys, frequently (but not always) produces the error, when more than two workers are requested:

database disk image is malformed

I can restructure my code to avoid calling select in the parallelized code, supplying a sufficiently comprehensive identifier map by other means. But perhaps this is not so hard to fix in the library?

The crucial code below is the lookup function. The get and mget calls are merely setup, giving us an easily reproduced list of 1551 gene symbols. The somewhat contrived task here is to convert the long lists of gene symbols to lists of ENSEMBL gene ids. To keep the set up simple, all of my three gene symbol lists are the same 1551 genes.

library(BiocParallel)
library(org.Hs.eg.db)

lookup <- function(geneSymbols){
   tbl.map <- select(org.Hs.eg.db, keys=geneSymbols, keytype="SYMBOL", columns="ENSEMBL")
   }

tf.entrezIDs <- unique(unlist(get("GO:0003700",
                       envir=org.Hs.egGO2ALLEGS)), use.names=FALSE)
tf.geneSymbols <- unique(unlist(mget(tf.entrezIDs, 
                         envir=org.Hs.egSYMBOL), use.names=FALSE))

geneSymbols.list <- list(g1=tf.geneSymbols, g2=tf.geneSymbols, g3=tf.geneSymbols)
x <- bplapply(geneSymbols.list, lookup)

Error: BiocParallel errors
element index: 1, 2, 3
first error: database disk image is malformed

sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /local/users/pshannon/local/lib/R/lib/libRblas.so
LAPACK: /local/users/pshannon/local/lib/R/lib/libRlapack.so

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] parallel  stats4    stats     graphics  grDevices utils     datasets
 [8] methods   base

 other attached packages:
 [1] org.Hs.eg.db_3.6.0   AnnotationDbi_1.42.1 IRanges_2.14.10
 [4] S4Vectors_0.18.3     Biobase_2.40.0       BiocGenerics_0.26.0
 [7] BiocParallel_1.14.1

 loaded via a namespace (and not attached):
  [1] Rcpp_0.12.17    digest_0.6.15   DBI_1.0.0       RSQLite_2.1.1
  [5] blob_1.1.1      tools_3.5.0     bit64_0.9-7     bit_1.1-14
  [9] compiler_3.5.0  pkgconfig_2.0.1 memoise_1.1.0

ADD REPLY • link 7.6 years ago Paul Shannon ▴ 470

2

Entering edit mode

The problem here is that library(org.Hs.eg.db) opens a database connection in the main thread, and then the workers use it independently in the worker threads via select(). The solution is like your Postgres example, to open (and close) the connection on the workers. Something like

lookup <- function(geneSymbols, dbfile) {
    db <- AnnotationDbi::loadDb(dbfile)
    on.exit(RSQLite::dbDisconnect(dbconn(db)))
    tbl.map <- AnnotationDbi::select(
        db, keys=geneSymbols, keytype="SYMBOL", columns="ENSEMBL"
    )
}

x <- bplapply(geneSymbols.list, lookup, dbfile(org.Hs.eg.db))

For what it's worth the 'old school' join can be replaced by (a sluggish)

tf.geneSymbols <-
    unique( select(org.Hs.eg.db, "GO:0003700", "SYMBOL", "GOALL")$SYMBOL )

ADD REPLY • link 7.6 years ago Martin Morgan 25k

0

Entering edit mode

Thanks, Martin. Funny thing, all these years in and I did not know that the org db files were in-memory sqlite databases - with not only implicit connect and disconnect, but explicit versions also. Seeing that in your example I will now code appropriately.

ADD REPLY • link 7.6 years ago Paul Shannon ▴ 470