After reading the vignette for DelayedMatrixStats, I'm confused about the mention of two types of RleArray seeds: SolidRleArraySeed & ChunkedRleArraySeed
What's the difference between these two seeds and a simple DataFrame of Rle's as seed? Are there perfomance/memory differences between them? Are there any pro/cons associated with each of them?
I'm going to assume we're talking about representing 2-dimesionsal arrays.
It comes down to how the data are stored (i.e. run length encoded):
A DataFrame with Rle columns is run length encoded column-wise with a separate Rle per column.
A SolidRleArraySeed-backed DelayedArray is run length encoded column-wise but with one Rle for the entire dataset (think of it as wrapping around from the bottom of column 1 to the start of column 2, etc.).
A ChunkedRleArraySeed-backed DelayedArray is run length encoded column-wise but in chunks of a fixed size.
I'll include some example code at in a follow-up comment that illustrates this (it's too long to include in one post).
The best choice depends on several factors, such as:
The run length patterns of your data
How you want to access the data (e.g. if you only require column-wise access, then data that is chunked per-column is theoretically more efficient)
Whether you want to use the broader DelayedArray framework (in which case, a DataFrame probably isn't the best choice).
Thank you for the detailed explanation. A few follow up question:
- Why is it beneficial to store the data as a single Rle in SolidRleArraySeed? Doesn't that limit to the total size of the matrix since it spans all columns?
- If you chunk a ChunkedRleArray seed by column, how is that different from wrapping a DataFrame of Rle's in a DelayedArray?
Why is it beneficial to store the data as a single Rle in SolidRleArraySeed?
I think it's more the case that this is a simpler object to design and program with (Hervé Pagès will know best).
Doesn't that limit to the total size of the matrix since it spans all columns?
Yes, I think so.
If you chunk a ChunkedRleArray seed by column, how is that different from wrapping a DataFrame of Rle's in a DelayedArray?
Off the top of my head, there's probably no difference (but there may be weird corner cases). Oh, one difference is the tricksy thing I mentioned with a ChunkedRleArraySeed-backed DelayedArray - it may lead to a slightly smaller memory footprint for the ChunkedRleArraySeed-backed DelayedArray if all values are integers in [0, 255] (untested)
The exact internal representation of an RleArray object is still a work-in-progress and subject to change. The 2 current *RleArraySeed classes reflect the 2 experimentations I did so far with this. ChunkedRleArraySeed came after SolidRleArraySeed and improves on it by replacing a single Rle by a collection of smaller Rle's (one per chunk) placed in an environment. This makes realization of an arbitrary DelayedArray object x as an RleArray object (with as(x, "RleArray")) more memory efficient. Also it can achieve slightly better compression when the array data is of type integer (e.g. count data) by storing the data of a given chunk in a raw-Rle instead of an integer-Rle if all the values in the chunk are >= 0 and <= 255. So ChunkedRleArraySeed is currently the default but SolidRleArraySeed has been kept around for backward compatibility with old serialized RleArray objects.
Compared to using a seed that is just a DataFrame with Rle columns, using a ChunkedRleArraySeed seed will probably not make much difference for read access. However using the former would be inefficient in the context of realization as it would trigger a copy of the entire data every time a block is written to it (realization is done block by block). ChunkedRleArraySeed was specifically designed for being an efficient realization sink.
Finally the plan for BioC 3.9 is to improve ChunkedRleArraySeed by allowing: (1) arbitrary chunk geometry, (2) have the Rle within each chunk run either along the columns or along the rows, and (3) generalize the use of raw-Rle to any chunk that contains 256 unique values or less (such chunk would also contain a small translation table to decode the data). This will provide opportunities to choose the chunk geometry and orientation (i.e. col-major or row-major) that is optimal for a given data set and for the typical access pattern to it.
Doesn't that limit to the total size of the matrix since it spans all columns?
Indeed. I should add that one of the original motivations for coming up with ChunkedRleArraySeed was exactly this limitation. I wanted to be able to load the 1.3 Million Brain Cell Dataset (https://bioconductor.org/packages/release/data/experiment/html/TENxBrainData.html) in an RleArray object but using a SolidRleArraySeed seed wouldn't allow me to do this because it can only represent a data set of length <= 2^31 - 1.
Thanks for the detailed response! I'm considering updating the CAGEfightR-package from using dgCMatrix to using DelayedArray instead (giving the new addition of parallel processing and quick summaries from DelayedMatrixStats, particularly rowsum). Using RleMatrix seems like the obvious choice for replacement, but I'm hesitant to update if there are many planned changes coming up. Is it premature to start basing packages around RleMatrix?
I might change a little bit the interface of the RleArray() constructor function though (e.g. replacing the chunksize argument with chunkdim). Other than that, as long as you don't serialize your RleMatrix instances and your code doesn't try to access directly the internals of these objects, any change I'll make to their internals shouldn't affect you, at least in theory.
Thank you for the detailed explanation. A few follow up question:
- Why is it beneficial to store the data as a single Rle in SolidRleArraySeed? Doesn't that limit to the total size of the matrix since it spans all columns?
- If you chunk a ChunkedRleArray seed by column, how is that different from wrapping a DataFrame of Rle's in a DelayedArray?
I think it's more the case that this is a simpler object to design and program with (Hervé Pagès will know best).
Yes, I think so.
Off the top of my head, there's probably no difference (but there may be weird corner cases). Oh, one difference is the tricksy thing I mentioned with a ChunkedRleArraySeed-backed DelayedArray - it may lead to a slightly smaller memory footprint for the ChunkedRleArraySeed-backed DelayedArray if all values are integers in [0, 255] (untested)
Hi,
The exact internal representation of an RleArray object is still a work-in-progress and subject to change. The 2 current *RleArraySeed classes reflect the 2 experimentations I did so far with this. ChunkedRleArraySeed came after SolidRleArraySeed and improves on it by replacing a single Rle by a collection of smaller Rle's (one per chunk) placed in an environment. This makes realization of an arbitrary DelayedArray object
x
as an RleArray object (withas(x, "RleArray")
) more memory efficient. Also it can achieve slightly better compression when the array data is of type integer (e.g. count data) by storing the data of a given chunk in a raw-Rle instead of an integer-Rle if all the values in the chunk are >= 0 and <= 255. So ChunkedRleArraySeed is currently the default but SolidRleArraySeed has been kept around for backward compatibility with old serialized RleArray objects.Compared to using a seed that is just a DataFrame with Rle columns, using a ChunkedRleArraySeed seed will probably not make much difference for read access. However using the former would be inefficient in the context of realization as it would trigger a copy of the entire data every time a block is written to it (realization is done block by block). ChunkedRleArraySeed was specifically designed for being an efficient realization sink.
Finally the plan for BioC 3.9 is to improve ChunkedRleArraySeed by allowing: (1) arbitrary chunk geometry, (2) have the Rle within each chunk run either along the columns or along the rows, and (3) generalize the use of raw-Rle to any chunk that contains 256 unique values or less (such chunk would also contain a small translation table to decode the data). This will provide opportunities to choose the chunk geometry and orientation (i.e. col-major or row-major) that is optimal for a given data set and for the typical access pattern to it.
H.
Doesn't that limit to the total size of the matrix since it spans all columns?
Indeed. I should add that one of the original motivations for coming up with ChunkedRleArraySeed was exactly this limitation. I wanted to be able to load the 1.3 Million Brain Cell Dataset (https://bioconductor.org/packages/release/data/experiment/html/TENxBrainData.html) in an RleArray object but using a SolidRleArraySeed seed wouldn't allow me to do this because it can only represent a data set of length <= 2^31 - 1.
Thanks for the detailed response! I'm considering updating the CAGEfightR-package from using dgCMatrix to using DelayedArray instead (giving the new addition of parallel processing and quick summaries from DelayedMatrixStats, particularly rowsum). Using RleMatrix seems like the obvious choice for replacement, but I'm hesitant to update if there are many planned changes coming up. Is it premature to start basing packages around RleMatrix?
I might change a little bit the interface of the
RleArray()
constructor function though (e.g. replacing thechunksize
argument withchunkdim
). Other than that, as long as you don't serialize your RleMatrix instances and your code doesn't try to access directly the internals of these objects, any change I'll make to their internals shouldn't affect you, at least in theory.