Speeding up of DelayedArray::write_block()
1
0
Entering edit mode
Koki ▴ 10
@koki-7888
Last seen 6 months ago
Japan

I implemented some functions based on block processing, referring to the write_block() documentation.

https://rdrr.io/bioc/DelayedArray/man/write_block.html

I can now control the memory usage of my function.

However, when the data is large, the process of write_block() itself occupies many of the computations and becomes slow.

Using HDF5Array::setHDF5DumpCompressionLevel(0L) to write the data uncompressed,

I found that the calculation time was improved considerably but I still want to make it a little faster.

Do you know of any tips related to speeding up write_block() yet?

I'm currently thinking of the following right now though.

1. Sparse Array:

I thought I could speed up the process by using sparse multi-dimensional arrays that only handle non-zero values.

However, although sparse matrix is defined in the Matrix package, I don't know any good implementation of handling sparse multi-dimensional arrays in R.

Also, I found that sparse2dense() is performed in in write_block().

https://github.com/Bioconductor/DelayedArray/blob/1026afd3bb55fbf54509b3675ac36af6767ef70a/R/write_block.R#L34

Does this mean that even if I use SparseArraySeed or as.sparse=TRUE, it will be forced to convert to a dense array when writing to the HDF5 file?

2. Chunk Size:

To begin with, does the "chunk_dim" option in writeHDF5Array mean the same thing as "chunk size" in HDF5 files?

https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/

I found that the chunk_dim option is automatically set by getHDF5DumpChunkDim() and seems to be specified by the dimension of the block extracted as the viewport.

Am I understanding this correctly?

https://bioc.ism.ac.jp/packages/3.7/bioc/manuals/HDF5Array/man/HDF5Array.pdf

https://github.com/Bioconductor/HDF5Array/blob/266ec6860490b6e7ff8b83d8906d44085325868a/R/writeHDF5Array.R#L104

Also, is there any possibility that adjusting the value of this chunk_dim will improve the speed of write_block()?

3. Parallel Computing:

I found that the functions blockApply() and blockReduce() are implemented in DelayedArray, which is BiocParallel based.

Using these functions, can write_block() be faster?

Does this mean that both the computations of the functions against each block and the writing of the computational results to the HDF5 file will be performed in a multi-process?

I think parallel I/O in HDF5 is still a difficult problem to solve though.

Besides, does it mean that parallel computing with 2 processes will result in less than 2 times the computation time, but 2 times the memory usage?

Best,

Koki

HDF5Array DelayedArray • 1.8k views
ADD COMMENT
0
Entering edit mode

Can someone please respond to this when you have time?

ADD REPLY
1
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi Koki,

Sorry for the slow response.

1. Sparse Array:

Does this mean that even if I use SparseArraySeed or as.sparse=TRUE, it will be forced to convert to a dense array when writing to the HDF5 file?

Yes. Writing to the HDF5 file is done with rhdf5::h5write() which only knows about dense arrays. The thing is that there's no native support for sparse data in HDF5 and the HDF5 C library only provides commodities to handle dense data. In particular C function H5Dwrite() provided by this library (and used by rhdf5::h5write() to write the data) only accepts dense input data. This doesn't mean that there isn't room for improvement e.g. we could implement a version of rhdf5::h5write() that writes the input data chunk by chunk and only expands the sparse data at the chunk level, right before writing it to disk. This is not a trivial effort though.

Related to multidimensional sparse arrays in R: SparseArraySeed was a quick/easy solution but is not a very good one. I've actually been busy with the development of a new S4 class to replace it: SVT_SparseArray. It's in the S4Arrays package (https://github.com/Bioconductor/S4Arrays) which is not part of Bioconductor yet. It uses a new layout to organize the data internally. This is still work-in-progress but I'm happy with how it performs so far. Once S4Arrays makes it to Bioconductor, the plan is to have the DelayedArray stack depend on it and to replace the use of SparseArraySeed objects with SVT_SparseArray objects everywhere. This will in general improve performance of block processing when the blocks are sparse.

2. Chunk Size:

To begin with, does the "chunk_dim" option in writeHDF5Array mean the same thing as "chunk size" in HDF5 files?

Yes but it's the reverse of it (because everything is transposed between R and HDF5).

Also, is there any possibility that adjusting the value of this chunk_dim will improve the speed of write_block()?

You can specify the chunk_dim when you call HDF5RealizationSink().

3. Parallel Computing:

As you're aware, parallel I/O in HDF5 is tricky. Maybe it's possible to support parallel calls to write_block(), I've not looked into it. The thing is, I don't anticipate much benefits to this because of that:

parallel computing with 2 processes will result in less than 2 times the computation time, but 2 times the memory usage?

Probably. And I don't even think it's going to be 2 times faster because the concurrent calls to write_block() will now compete for I/O.

Best,

H.

ADD COMMENT
0
Entering edit mode

Thanks again for your kind comments, Hervé.

I'll check if chunk_dim in HDF5RealizationSink() contributes to the speed. From my experience, using the sparse format is dramatically more effective in terms of matrix/array operations and I/O. Actually, other HDF5-related file formats such as 10X Genomics' HDF5 format and Loom also seem to support the sparse format. I'm looking forward to S4Array::SVT_SparseArray.

Thanks,

Koki

ADD REPLY
0
Entering edit mode

I found that SparseArray package containing SVT_SparseArray you referred to before has been released since Bioconductor 3.17. https://bioconductor.org/packages/release/bioc/html/SparseArray.html

Congratulation! As you said before, will SparseArraySeed in DelayedArray be replaced by this new sparse array format?

ADD REPLY
1
Entering edit mode

Hi Koki, sorry for not seeing this before.

will SparseArraySeed in DelayedArray be replaced by this new sparse array format?

Yes it will. This is still a work-in-progress though.

Best,

H.

ADD REPLY
0
Entering edit mode

There are a few places inside my DelayedTensor package that use SparseArraySeed.

Will these stop working in next coming BioC 3.18, when they are replaced by the SparseArray package?

ADD REPLY
1
Entering edit mode

Not in BioC 3.18. At some point (e.g. in 6 months) they will be deprecated, then 6 months later (i.e. in 1 year) they will be defunct, then 6 months later (i.e. in 1.5 years) they'll be removed.

ADD REPLY
0
Entering edit mode

In the DelayedArray two years ago, sparse2dense() was performed in write_block().

cf. https://github.com/Bioconductor/DelayedArray/blob/1026afd3bb55fbf54509b3675ac36af6767ef70a/R/write_block.R#L34

However, I found that such a part has been removed. Could you tell me the details? Does the current implementation allow sparse arrays to be written to and read from HDF5 files, thereby increasing computation speed?

ADD REPLY
0
Entering edit mode

Sorry for missing that one too.

When writting blocks to an HDF5 file, the write_block() method for HDF5RealizationSink or TENxRealizationSink objects is invoked, not the default write_block() method that you are referencing to above:

library(HDF5Array)

selectMethod("write_block", "HDF5RealizationSink")
# Method Definition:
# 
# function (sink, viewport, block) 
# {
#     if (!is.array(block)) 
#         block <- as.array(block)
#     h5write(block, sink@filepath, sink@name, start = start(viewport), 
#         count = width(viewport))
#     sink
# }
...

selectMethod("write_block", "TENxRealizationSink")
# Method Definition:
#
# function (sink, viewport, block) 
# {
#     .check_viewport(viewport, sink)
#     if (!is(block, "SparseArraySeed")) 
#         block <- as(block, "SparseArraySeed")
#     new_data_len1 <- .append_data(sink@filepath, sink@group, 
#         block@nzdata)
#     new_data_len2 <- .append_row_indices(sink@filepath, sink@group, 
#         block@nzindex[, 1L] - 1L)
#     stopifnot(new_data_len2 == new_data_len1)
#     new_data_len3 <- .append_indptr(sink@filepath, sink@group, 
#         block@nzindex[, 2L], ncol(viewport))
#     stopifnot(new_data_len3 == new_data_len1)
#     sink
# }
...

AFAIK these have not changed i.e. the former still turns the block to be written into a dense array (with as.array()) before writting it to the file, and the latter still turns it into a sparse representation (with as(block, "SparseArraySeed")).

Note that HDF5 itself does not support sparse representation. Sparse representation in HDF5 is achieved at the application level by organizing the sparse data in a way that is specific to, and controlled by, the application. For example the 10x Genomics people use a CSC layout that is "understood" by the TENxMatrix/TENxMatrixSeed classes and constructors (for reading), and by the TENxRealizationSink class and constructor for writting.

Hope this helps,

H.

ADD REPLY
0
Entering edit mode

I don't know the details but do you mean using HDF5RealizationSink or TENxRealizationSink will make it possible to writing sparse multi-dimensional arrays to HDF5?

I imagined that CSC is a 2D matrix-specific sparse format and for sparse multi-dimensional array (e.g., 3D or 4D array), using Coordinate Format (COO) is straightforward.

Is such a format supported by HDF5RealizationSink or TENxRealizationSink or do I have to implement such a function by myself?

ADD REPLY
0
Entering edit mode

I don't know the details but do you mean using HDF5RealizationSink or TENxRealizationSink will make it possible to writing sparse multi-dimensional arrays to HDF5?

HDF5RealizationSink is for writing dense data to HDF5, and TENxRealizationSink is for writing 2D sparse data to HDF5 in the 10x Genomics format (which uses a CSC layout on top of HDF5).

I imagined that CSC is a 2D matrix-specific sparse format and for sparse multi-dimensional array (e.g., 3D or 4D array), using Coordinate Format (COO) is straightforward.

Yes, the CSC layout used by 10x Genomics and others on top of HDF5 is for 2D sparse data only. I don't know of anybody using a COO layout on top of HDF5 for storing multi-dimensional sparse data. Sounds feasible but I don't have plans to work on something like this. Note that the COO representation is not particularly effective for 2D data, and becomes even less effective for 3D or 4D data, unless the data is very sparse (e.g. sparseness < 5%).

Is such a format supported by HDF5RealizationSink or TENxRealizationSink or do I have to implement such a function by myself?

The HDF5Array package does not support writing multi-dimensional sparse data to HDF5, so it would be your responsibility to support that. This implies taking care of creating the COO layout on top of HDF5 and writing the data to it. In the DelayedArray world, this would typically be done by implementing a RealizationSink derivative e.g. COO_Sparse_HDF5RealizationSink (sorry for the long and ugly name), and a writeHDF5SparseArray() function. You would also probably need the reading counterpart, which, in the DelayedArray world, would typically be done by implementing the COO_H5SparseArraySeed and COO_H5SparseArray classes and constructors. All this work would probably diserve to be in a package on its own.

H.

ADD REPLY
0
Entering edit mode

Thank you, I have a very clear understanding of the situation.

COO_Sparse_HDF5RealizationSink(), writeHDF5SparseArray(), COO_H5SparseArraySeed(), and COO_H5SparseArray()

So do you mean these are good to make inside my DelayedTensor package?

Is there anything I should be aware of regarding working with DelayedArray? For example, I can imagine writing a sparse high-dimensional array in HDF5, but I'm still not sure how to implement the part where it is read and immediately converted as a sparse array format supported by DelayedArray, especially since the new SparseArray package has been released.

ADD REPLY
1
Entering edit mode

I can imagine writing a sparse high-dimensional array in HDF5, but I'm still not sure how to implement the part where it is read and immediately converted as a sparse array format supported by DelayedArray, especially since the new SparseArray package has been released.

Before you embark on implementing your own COO layout on top of HDF5, I suggest you take a look at the tiledb (CRAN) and TileDBArray (Bioconductor) packages. I believe they support writing/reading sparse high-dimensional arrays.

ADD REPLY

Login before adding your answer.

Traffic: 660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6