Speeding up of DelayedArray::write_block()
1
0
Entering edit mode
Koki ▴ 10
@koki-7888
Last seen 3 days ago
Japan

I implemented some functions based on block processing, referring to the write_block() documentation.

https://rdrr.io/bioc/DelayedArray/man/write_block.html

I can now control the memory usage of my function.

However, when the data is large, the process of write_block() itself occupies many of the computations and becomes slow.

Using HDF5Array::setHDF5DumpCompressionLevel(0L) to write the data uncompressed,

I found that the calculation time was improved considerably but I still want to make it a little faster.

Do you know of any tips related to speeding up write_block() yet?

I'm currently thinking of the following right now though.

# 1. Sparse Array:

I thought I could speed up the process by using sparse multi-dimensional arrays that only handle non-zero values.

However, although sparse matrix is defined in the Matrix package, I don't know any good implementation of handling sparse multi-dimensional arrays in R.

Also, I found that sparse2dense() is performed in in write_block().

https://github.com/Bioconductor/DelayedArray/blob/1026afd3bb55fbf54509b3675ac36af6767ef70a/R/write_block.R#L34

Does this mean that even if I use SparseArraySeed or as.sparse=TRUE, it will be forced to convert to a dense array when writing to the HDF5 file?

# 2. Chunk Size:

To begin with, does the "chunk_dim" option in writeHDF5Array mean the same thing as "chunk size" in HDF5 files?

I found that the chunk_dim option is automatically set by getHDF5DumpChunkDim() and seems to be specified by the dimension of the block extracted as the viewport.

Am I understanding this correctly?

https://bioc.ism.ac.jp/packages/3.7/bioc/manuals/HDF5Array/man/HDF5Array.pdf

https://github.com/Bioconductor/HDF5Array/blob/266ec6860490b6e7ff8b83d8906d44085325868a/R/writeHDF5Array.R#L104

Also, is there any possibility that adjusting the value of this chunk_dim will improve the speed of write_block()?

# 3. Parallel Computing:

I found that the functions blockApply() and blockReduce() are implemented in DelayedArray, which is BiocParallel based.

Using these functions, can write_block() be faster?

Does this mean that both the computations of the functions against each block and the writing of the computational results to the HDF5 file will be performed in a multi-process?

I think parallel I/O in HDF5 is still a difficult problem to solve though.

Besides, does it mean that parallel computing with 2 processes will result in less than 2 times the computation time, but 2 times the memory usage?

Best,

Koki

HDF5Array DelayedArray • 200 views
0
Entering edit mode

Can someone please respond to this when you have time?

1
Entering edit mode
@herve-pages-1542
Last seen 2 hours ago
Seattle, WA, United States

Hi Koki,

Sorry for the slow response.

1. Sparse Array:

Does this mean that even if I use SparseArraySeed or as.sparse=TRUE, it will be forced to convert to a dense array when writing to the HDF5 file?

Yes. Writing to the HDF5 file is done with rhdf5::h5write() which only knows about dense arrays. The thing is that there's no native support for sparse data in HDF5 and the HDF5 C library only provides commodities to handle dense data. In particular C function H5Dwrite() provided by this library (and used by rhdf5::h5write() to write the data) only accepts dense input data. This doesn't mean that there isn't room for improvement e.g. we could implement a version of rhdf5::h5write() that writes the input data chunk by chunk and only expands the sparse data at the chunk level, right before writing it to disk. This is not a trivial effort though.

Related to multidimensional sparse arrays in R: SparseArraySeed was a quick/easy solution but is not a very good one. I've actually been busy with the development of a new S4 class to replace it: SVT_SparseArray. It's in the S4Arrays package (https://github.com/Bioconductor/S4Arrays) which is not part of Bioconductor yet. It uses a new layout to organize the data internally. This is still work-in-progress but I'm happy with how it performs so far. Once S4Arrays makes it to Bioconductor, the plan is to have the DelayedArray stack depend on it and to replace the use of SparseArraySeed objects with SVT_SparseArray objects everywhere. This will in general improve performance of block processing when the blocks are sparse.

2. Chunk Size:

To begin with, does the "chunk_dim" option in writeHDF5Array mean the same thing as "chunk size" in HDF5 files?

Yes but it's the reverse of it (because everything is transposed between R and HDF5).

Also, is there any possibility that adjusting the value of this chunk_dim will improve the speed of write_block()?

You can specify the chunk_dim when you call HDF5RealizationSink().

3. Parallel Computing:

As you're aware, parallel I/O in HDF5 is tricky. Maybe it's possible to support parallel calls to write_block(), I've not looked into it. The thing is, I don't anticipate much benefits to this because of that:

parallel computing with 2 processes will result in less than 2 times the computation time, but 2 times the memory usage?

Probably. And I don't even think it's going to be 2 times faster because the concurrent calls to write_block() will now compete for I/O.

Best,

H.

0
Entering edit mode

I'll check if chunk_dim in HDF5RealizationSink() contributes to the speed. From my experience, using the sparse format is dramatically more effective in terms of matrix/array operations and I/O. Actually, other HDF5-related file formats such as 10X Genomics' HDF5 format and Loom also seem to support the sparse format. I'm looking forward to S4Array::SVT_SparseArray.

Thanks,

Koki