Is it possible to parallelise saveHDF5SummarizedExperiment
1
0
Entering edit mode
@stefanomangiola-6873
Last seen 2 days ago
Australia

I compose big SingleCellExperiments and SummarizedExperiments, cbinding up to 1000s HDF5s. Would it be possible to speed up the saveHDF5SummarizedExperiment by many folds, perhaps parallelising it?

e.g. for this operation

Start writing assay 1/2 to HDF5 file:
  /vast/projects/cellxgene_curated/cellNexus/pseudobulk_joined/assays.h5
/ reading and realizing sparse block 1/744 ... ok
\ Writing it ... OK
/ reading and realizing sparse block 2/744 ... 

Thanks.

HDF5Array • 399 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 5 hours ago
Seattle, WA, United States

Hi Stefano,

Unfortunately HDF5 doesn't support concurrent writing so parallelising is not an option.

A few things that maybe could help:

  • Make sure that the binding was obtained with a single cbind(x_1, x_2, ..., x_N) call (i.e. N-ary cbind) instead of many calls to cbind(x, y) (i.e. binary cbind) in a loop. This will reduce the complexity of the stack of delayed ops (you can display this stack with showtree(assay(se))).

  • saveHDF5SummarizedExperiment() writes the data to disk one block at a time. Try to increase the block size with setAutoBlockSize(). Since your data is sparse, you should be able to use blocks that are much bigger than the default size (which is 100Mb). Maybe 10x or 50x bigger, depending on how sparse your data is.

  • Try different chunk geometries and compression levels. These are controlled via the chunkdim and level arguments.

Note that tiledb supports concurrent writing so maybe we need a saveTileDBSummarizedExperiment() function?

Best,

H.

ADD COMMENT
0
Entering edit mode

Thansk Herve', I will try.

For reference this is the parallelization implementation of zellkonverter, to save anndata from SCE

https://github.com/theislab/zellkonverter/issues/129#issuecomment-2473607227

ADD REPLY
0
Entering edit mode

saveTileDBSummarizedExperiment would be awesome. I think we need performance for execution time and memory required to save.

ADD REPLY

Login before adding your answer.

Traffic: 593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6