aperm, realize, writeHDF5Array, and ReshapedHDF5Array are block size aware?
1
0
Entering edit mode
Koki • 0
@koki-7888
Last seen 1 hour ago
Japan

I am using DelayedArray and HDF5Array inside my package, and some of my functions use getAutoBlockSize internally to get the block size so that it does not exceed the block size specified by setAutoBlockSize, and then use HDF5Array::write_block as documented, each calculation is done sequentially. https://rdrr.io/bioc/DelayedArray/man/write_block.html

By the way, is it safe to assume that all the functions implemented in DelayedArray and HDF5Array are basically block size aware? For example, the following functions are used in my functions, and I haven't written any code of block process in the explicit, but can I assume that these recognize the block size and process it sequentially? I couldn't find any place in the code where getAutoBlockSize is written explicitly though.

• DelayedArray::aperm
• DelayedArray::realize
• HDF5Array::writeHDF5Array
• HDF5Array::ReshapedHDF5Array

Also, I would like to know if there is a way to find out if a source code is block size aware or not. If there is the list somewhere, it would be helpful.

Koki

HDF5Array DelayedArray • 172 views
1
Entering edit mode
@herve-pages-1542
Last seen 6 hours ago
Seattle, WA, United States

Hi,

Operations on DelayedArray objects are divided into 2 big families: delayed operations and block-processed operations.

Delayed operations don't perform anything so they typically don't generate any IO and are almost instantaneous. This is something you can feel at the command line e.g. when you do something like M2 <- log(M + 1). All what this does is stack the a -> a + 1 and a -> log(a) operations on top of M and return the object with the new stacked operations in M2. One important thing about delayed operations is that they always return another DelayedArray or DelayedMatrix object (so for example, things like sum(M) or colVars(M) cannot be delayed operations because they return ordinary vectors). You can see the delayed operations carried by an object with showtree(M2). The cost of this stacking is almost nothing, and, most importantly, it doesn't depend on the size of M. Delayed operations don't need to know anything about blocks, grids, chunk geometries, data sparsity, or data compression.

Only block-processed operations are block size aware and respect getAutoBlockSize().

All operations supported by DelayedArray objects should be listed in man pages ?DelayedArray-utils, ?DelayedArray-stats, ?DelayedMatrix-utils, and ?DelayedMatrix-stats, and each man page will tell if an operation is delayed or block-processed. Note that more operations are provided by the DelayedMatrixStats package, and, AFAIK, all these operations are block-processed.

• DelayedArray::aperm() and HDF5Array::ReshapedHDF5Array() are delayed operations.

• DelayedArray::realize() and HDF5Array::writeHDF5Array() are block-processed.

Hope this helps,

H.

0
Entering edit mode

Very interesting.

So, if I run a delayed operation and don't perform the actual calculation, but simply stack the calculation, and then realize the calculation after that, can I assume that the writes to the file (e.g. HDF5) that are required during the calculation are block size aware?

For example, the following codes use a combination of delayed operation and block-processed operation but the code as a whole does not exceed the block size, is that correct?

# simple delayed operation + realize
M2 <- realize(log(M + 1), "HDF5Array")

# aperm + realize
M2 <- realize(aperm(M, c(2,1,3)), "HDF5Array")

# ReshapedHDF5Array + realize
tmpfile <- tempfile()
writeHDF5Array(M, filepath=tmpfile, name="tmp", as.sparse=TRUE)
M2 <- realize(ReshapedHDF5Array(tmpfile, "tmp", new_modes))


Also, I think that even a simple delayed operation can cause a memory error (e.g., Error: C stack usage of HDF5Array) and does it mean that there is not enough memory to stack the calculation?

Koki

1
Entering edit mode

For example, the following codes use a combination of delayed operation and block-processed operation but the code as a whole does not exceed the block size, is that correct?

Yes, that's correct. More precisely: realize(x, "HDF5Array") just calls as(x, "HDF5Array"), which just calls writeHDF5Array(x), so the three are equivalent. The workhorse behind writeHDF5Array(x) is DelayedArray::BLOCK_write_to_sink() (this is an internal helper so is not documented). As its name suggests DelayedArray::BLOCK_write_to_sink() is block-size aware i.e. it will define a grid of blocks on x that respects getAutoBlockSize(), walk on the blocks of that grid, and realize each block before writing them to disk.

Note however that choosing blocks that respect getAutoBlockSize() isn't a guarantee that the code won't use more memory than the block size. This is a common misconception. See the last paragraph of the Details section in ?getAutoBlockSize for more information about this.

Also, I think that even a simple delayed operation can cause a memory error.

Well, not a simple delayed operation. You need to stack tens of thousands of delayed operations on an object to end up with a "C stack usage is too close to the limit" problem. This typically happens when you apply a delayed operation in a loop which is almost never a good idea.

H.

0
Entering edit mode

Ok, I got the gist of it.

If I have the same situation of the previous case (Error: C stack usage of HDF5Array), where I have to do delayed operations repeatedly, I'd better perform realize often to avoid the "C stack usage is too close to the limit" error.

Thanks a lot.