4 months ago by
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
I'm going to assume we're talking about representing 2-dimesionsal arrays.
It comes down to how the data are stored (i.e. run length encoded):
- A DataFrame with Rle columns is run length encoded column-wise with a separate Rle per column.
- A SolidRleArraySeed-backed DelayedArray is run length encoded column-wise but with one Rle for the entire dataset (think of it as wrapping around from the bottom of column 1 to the start of column 2, etc.).
- A ChunkedRleArraySeed-backed DelayedArray is run length encoded column-wise but in chunks of a fixed size.
I'll include some example code at in a follow-up comment that illustrates this (it's too long to include in one post).
The best choice depends on several factors, such as:
- The run length patterns of your data
- How you want to access the data (e.g. if you only require column-wise access, then data that is chunked per-column is theoretically more efficient)
- Whether you want to use the broader DelayedArray framework (in which case, a DataFrame probably isn't the best choice).
Hope this helps,