Search
Question: DelayedArray seeds: SolidRleArraySeed vs ChunkedRleArraySeed vs DataFrame of Rle's
0
23 days ago by
maltethodberg100
UCPH
maltethodberg100 wrote:

After reading the vignette for DelayedMatrixStats, I'm confused about the mention of two types of RleArray seeds: SolidRleArraySeed & ChunkedRleArraySeed

What's the difference between these two seeds and a simple DataFrame of Rle's as seed? Are there perfomance/memory differences between them? Are there any pro/cons associated with each of them?

modified 22 days ago by Peter Hickey440 • written 23 days ago by maltethodberg100
4
22 days ago by
Peter Hickey440
Johns Hopkins University, Baltimore, USA
Peter Hickey440 wrote:

I'm going to assume we're talking about representing 2-dimesionsal arrays.

It comes down to how the data are stored (i.e. run length encoded):

• A DataFrame with Rle columns is run length encoded column-wise with a separate Rle per column.
• A SolidRleArraySeed-backed DelayedArray is run length encoded column-wise but with one Rle for the entire dataset (think of it as wrapping around from the bottom of column 1 to the start of column 2, etc.).
• A ChunkedRleArraySeed-backed DelayedArray is run length encoded column-wise but in chunks of a fixed size.

I'll include some example code at in a follow-up comment that illustrates this (it's too long to include in one post).

The best choice depends on several factors, such as:

1. The run length patterns of your data
2. How you want to access the data (e.g. if you only require column-wise access, then data that is chunked per-column is theoretically more efficient)
3. Whether you want to use the broader DelayedArray framework (in which case, a DataFrame probably isn't the best choice).

Hope this helps,

Pete

1
suppressPackageStartupMessages(library(S4Vectors))
suppressPackageStartupMessages(library(DelayedArray))

nrow <- 200L
ncol <- 5L

# Generate some data that can be efficiently run length encoded.
x <- Rle(500L + sort(rpois(nrow * ncol, 10)))
x
#> integer-Rle of length 1000 with 21 runs
#>   Lengths:   1   2   8  27  32  60  98 121 ...  35  17   6   2   5   2   1
#>   Values : 501 502 503 504 505 506 507 508 ... 515 516 517 518 519 520 522

# Construct a DataFrame with the data in 5 columns
# (there's a bit of messing about to do this)
df <- DataFrame(as.list(split(x, rep(seq_len(ncol), each = nrow))))
# The data are stored as one Rle per column.
df@listData
#> $X1 #> integer-Rle of length 200 with 7 runs #> Lengths: 1 2 8 27 32 60 70 #> Values : 501 502 503 504 505 506 507 #> #>$X2
#> integer-Rle of length 200 with 3 runs
#>   Lengths:  28 121  51
#>   Values : 507 508 509
#>
#> $X3 #> integer-Rle of length 200 with 3 runs #> Lengths: 76 119 5 #> Values : 509 510 511 #> #>$X4
#> integer-Rle of length 200 with 2 runs
#>   Lengths: 105  95
#>   Values : 511 512
#>
#> $X5 #> integer-Rle of length 200 with 10 runs #> Lengths: 12 64 56 35 17 6 2 5 2 1 #> Values : 512 513 514 515 516 517 518 519 520 522 # Construct a SolidRleArraySeed-backed DelayedArray sras <- RleArray(x, dim = c(nrow, ncol)) # The data are in a single Rle. seed(sras) #> An object of class "SolidRleArraySeed" #> Slot "rle": #> integer-Rle of length 1000 with 21 runs #> Lengths: 1 2 8 27 32 60 98 121 ... 35 17 6 2 5 2 1 #> Values : 501 502 503 504 505 506 507 508 ... 515 516 517 518 519 520 522 #> #> Slot "DIM": #> [1] 200 5 #> #> Slot "DIMNAMES": #> [[1]] #> NULL #> #> [[2]] #> NULL # Construct a ChunkedRleArraySeed-backed DelayedArray with 1 chunk/column. cras <- RleArray(x, dim = c(nrow, ncol), chunksize = nrow) # The data are stored in an environment seed(cras) #> An object of class "ChunkedRleArraySeed" #> Slot "breakpoints": #> [1] 200 400 600 800 1000 #> #> Slot "type": #> [1] "integer" #> #> Slot "chunks": #> <environment: 0x565279e17b38=""> #> #> Slot "DIM": #> [1] 200 5 #> #> Slot "DIMNAMES": #> [[1]] #> NULL #> #> [[2]] #> NULL ls(seed(cras)@chunks) #> [1] "000001" "000002" "000003" "000004" "000005" # Let's take a look at the the values in the first chunk seed(cras)@chunks$000001
#> integer-Rle of length 200 with 7 runs
#>   Lengths:   1   2   8  27  32  60  70
#>   Values : 501 502 503 504 505 506 507
# The first chunk of cras is the same as the first column of df
# NOTE: This brushes over some tricksy stuff RleArray does to further compress
#       data where all values are in [0, 255]; see
#       https://github.com/Bioconductor/DelayedArray/blob/master/R/RleArray-class.R#L316
identical(seed(cras)@chunks$000001, df$X1)
#> [1] TRUE

# Construct a ChunkedRleArraySeed-backed DelayedArray with 10 chunks/column.
cras2 <- RleArray(x, dim = c(nrow, ncol), chunksize = nrow / 10)
# Now have 10-times as many chunks.
ls(seed(cras2)@chunks)
#>  [1] "000001" "000002" "000003" "000004" "000005" "000006" "000007"
#>  [8] "000008" "000009" "000010" "000011" "000012" "000013" "000014"
#> [15] "000015" "000016" "000017" "000018" "000019" "000020" "000021"
#> [22] "000022" "000023" "000024" "000025" "000026" "000027" "000028"
#> [29] "000029" "000030" "000031" "000032" "000033" "000034" "000035"
#> [36] "000036" "000037" "000038" "000039" "000040" "000041" "000042"
#> [43] "000043" "000044" "000045" "000046" "000047" "000048" "000049"
#> [50] "000050"


Thank you for the detailed explanation. A few follow up question:
- Why is it beneficial to store the data as a single Rle in SolidRleArraySeed? Doesn't that limit to the total size of the matrix since it spans all columns?

- If you chunk a ChunkedRleArray seed by column, how is that different from wrapping a DataFrame of Rle's in a DelayedArray?

1

Why is it beneficial to store the data as a single Rle in SolidRleArraySeed?

I think it's more the case that this is a simpler object to design and program with (Hervé Pagès will know best).

Doesn't that limit to the total size of the matrix since it spans all columns?

Yes, I think so.

If you chunk a ChunkedRleArray seed by column, how is that different from wrapping a DataFrame of Rle's in a DelayedArray?

Off the top of my head, there's probably no difference (but there may be weird corner cases). Oh, one difference is the tricksy thing I mentioned with a ChunkedRleArraySeed-backed DelayedArray - it may lead to a slightly smaller memory footprint for the ChunkedRleArraySeed-backed DelayedArray if all values are integers in [0, 255] (untested)

1

Hi,

The exact internal representation of an RleArray object is still a work-in-progress and subject to change. The 2 current *RleArraySeed classes reflect the 2 experimentations I did so far with this. ChunkedRleArraySeed came after SolidRleArraySeed and improves on it by replacing a single Rle by a collection of smaller Rle's (one per chunk) placed in an environment. This makes realization of an arbitrary DelayedArray object x as an RleArray object (with as(x, "RleArray")) more memory efficient. Also it can achieve slightly better compression when the array data is of type integer (e.g. count data) by storing the data of a given chunk in a raw-Rle instead of an integer-Rle if all the values in the chunk are >= 0 and <= 255. So ChunkedRleArraySeed is currently the default but SolidRleArraySeed has been kept around for backward compatibility with old serialized RleArray objects.

Compared to using a seed that is just a DataFrame with Rle columns, using a ChunkedRleArraySeed seed will probably not make much difference for read access. However using the former would be inefficient in the context of realization as it would trigger a copy of the entire data every time a block is written to it (realization is done block by block). ChunkedRleArraySeed was specifically designed for being an efficient realization sink.

Finally the plan for BioC 3.9 is to improve ChunkedRleArraySeed by allowing: (1) arbitrary chunk geometry, (2) have the Rle within each chunk run either along the columns or along the rows, and (3) generalize the use of raw-Rle to any chunk that contains 256 unique values or less (such chunk would also contain a small translation table to decode the data). This will provide opportunities to choose the chunk geometry and orientation (i.e. col-major or row-major) that is optimal for a given data set and for the typical access pattern to it.

H.

1

Doesn't that limit to the total size of the matrix since it spans all columns?

Indeed. I should add that one of the original motivations for coming up with ChunkedRleArraySeed was exactly this limitation. I wanted to be able to load the 1.3 Million Brain Cell Dataset (https://bioconductor.org/packages/release/data/experiment/html/TENxBrainData.html) in an RleArray object but using a SolidRleArraySeed seed wouldn't allow me to do this because it can only represent a data set of length <= 2^31 - 1.

Thanks for the detailed response! I'm considering updating the CAGEfightR-package from using dgCMatrix to using DelayedArray instead (giving the new addition of parallel processing and quick summaries from DelayedMatrixStats, particularly rowsum). Using RleMatrix seems like the obvious choice for replacement, but I'm hesitant to update if there are many planned changes coming up. Is it premature to start basing packages around RleMatrix?

I might change a little bit the interface of the RleArray() constructor function though (e.g. replacing the chunksize argument with chunkdim). Other than that, as long as you don't serialize your RleMatrix instances and your code doesn't try to access directly the internals of these objects, any change I'll make to their internals shouldn't affect you, at least in theory.