Following up on the vignette in the devel-version for DelayedArray: Is there a timeline for implementation of multicore processing across blocks?
The timeline is: as soon as possible! I'll do my very best to have this into the next BioC release.
That sounds good! I am considering rewriting an R-package I am are developing to take advantage of the new features in DelayedArray. While it is much more memory efficient than my current implementation, it is quite a bit slower. The main bottle-necks are calls to rowSums, colSums, etc on large arrays. I assume simple cases like these would see a fairly significant boost in speed, especially on server grade computers with multiple cores?
I expect multicore block processing to improve speed but how much exactly will depend on the type of DelayedArray object. Let's keep in mind that with on-disk DelayedArray objects (HDF5Array), the cores will compete for I/O. And it's not clear to me that N cores trying to read N HDF5 blocks concurrently are going to do a much better job than 1 core reading the N blocks sequentially. Of course it will also depend on your hard drive, with SSD being better at handling concurrent read access than rotating drives. For in-memory DelayedArray objects (e.g. RleArray) multicore block processing will probably give a more significant speed boost.
This was for read-only multicore block processing. Note that operations that write the data to disk (e.g. realization or matrix multiplication) won't be able to support multicore block processing if the realization backend is HDF5 because HDF5 does not support concurrent write access to a dataset yet. However, we should be able to support multicore realization of a DelayedArray as an RleArray object.
I am only using in-memory objects at the moment, primarily DataFrame of Rle's wrapped with DelayedArray. I was actually wondering what the difference is between a DelayedArray around a DataFrame compared to an RleMatrix - conceptually they seem very similar? Is one more efficient speed or memory-wise? Will both be able to have parallel block-processing?
The plan for RleArray and RleMatrix objects is to use a seed (RleArraySeed) that supports chunking. This will allow better compression and more efficient block processing (especially in the case of multicore block processing) than a DataFrame of Rle's wrapped in a DelayedArray.
Furthermore, this chunking will allow multicore realization of a DelayedArray as an RleArray or RleMatrix object. Not something that is really feasible with a DataFrame of Rle's wrapped in a DelayedArray.
Interesting! Looking forward, in order to unlock most of the features of DelayedArray (better compression + parallel block processing), it's better to use RleMatrix instead of a DataFrame of Rle's?
One reason for using a DataFrame of Rle's is that it seems much faster to create than an RleMatrix. What's a good way to build a very large RleMatrix? Obviously, first creating a normal matrix is not possible, since that would take up to much memory. Coercing a DataFrame of Rle's to a RleMatrix seems very slow as well.