Hi all,
I am looking for a way to use HDF5Array
and DelayedArray
to operate together two large matrices of different dimensions (A
and B
below) that share equal number of rows and row-chunk geometry.
My immediate solution to this problem was to add zeros to the smaller matrix B
(see B0
below), and then operate them together as a 3d array.
Is there a more efficient solution to this problem? I was thinking about something along the lines of a row-chunk iterator that would allow me to operate the two matrices without having to create the array x
below.
Thanks!
> library(HDF5Array)
>
> col_N <- 1e3
> col_n <- 3
> row_N <- 5e4
>
> A <- matrix(runif(col_N*row_N), ncol = col_N, nrow = row_N)
> B <- matrix(runif(3*row_N,-1,1), ncol = col_n, nrow = row_N)
>
> B0 <- cbind(B,matrix(0, ncol = col_N-col_n, nrow = row_N))
>
> x <- writeHDF5Array(x = array(c(A,B0),dim = c(row_N,col_N,2)),
+ chunkdim = c(5,col_N,2))
>
> x
<50000 x 1000 x 2> array of class HDF5Array and type "double":
,,1
[,1] [,2] [,3] ... [,999] [,1000]
[1,] 0.9366992 0.3885370 0.9432474 . 0.8581199 0.2343847
[2,] 0.9540822 0.5233401 0.9275795 . 0.8206745 0.1228269
... . . . . . .
[49999,] 0.2729992 0.2575988 0.2134806 . 0.1227398 0.9900415
[50000,] 0.9385657 0.9544607 0.6547487 . 0.5587343 0.4466597
,,2
[,1] [,2] [,3] ... [,999] [,1000]
[1,] -0.6489840 0.2446995 0.9654620 . 0 0
[2,] -0.2742762 -0.6363328 0.4642650 . 0 0
... . . . . . .
[49999,] 0.00637822 0.06429913 -0.90916511 . 0 0
[50000,] -0.45653045 0.70055948 -0.71097379 . 0 0
>
> foo <- function(x) {
+ y <- (abs(x[,,1])^0.25 - 1)/0.25
+
+ out <- lapply(seq_len(col_n),function(z){
+ sum(t(y) %*% x[,z,2])
+ })
+
+ unlist(out)
+ }
>
> blockFoo <- function(x) {
+ grid <- defaultAutoGrid(x)
+ block_foo <- blockApply(x, foo, grid = grid)
+ do.call(rbind,block_foo)
+ }
>
> output <- blockFoo(x)
>
> output
[,1] [,2] [,3]
[1,] 42437.943 -21488.400 15951.636
[2,] 4590.788 58686.650 -52564.467
[3,] 35634.653 -25044.865 33538.275
[4,] -14556.621 -7956.171 12180.909
[5,] -12395.838 -16234.261 -3642.189
[6,] 49027.775 -3704.667 31522.995
[7,] 51266.787 3508.757 -25403.900
[8,] 24198.451 50484.589 -14383.886
>
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /home/data/apps/R/R-4.1.0/lib64/R/lib/libRblas.so
LAPACK: /home/data/apps/R/R-4.1.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] HDF5Array_1.20.0 rhdf5_2.36.0 DelayedArray_0.18.0 IRanges_2.26.0 S4Vectors_0.30.0 MatrixGenerics_1.4.0
[7] matrixStats_0.59.0 BiocGenerics_0.38.0 Matrix_1.3-4 devtools_2.4.1 usethis_2.0.1
loaded via a namespace (and not attached):
[1] compiler_4.1.0 rhdf5filters_1.4.0 prettyunits_1.1.1 remotes_2.4.0 tools_4.1.0 testthat_3.0.2 pkgbuild_1.2.0
[8] pkgload_1.2.1 memoise_2.0.0 lifecycle_1.0.0 lattice_0.20-44 rlang_0.4.11 cli_2.5.0 fastmap_1.1.0
[15] withr_2.4.2 desc_1.3.0 fs_1.5.0 rprojroot_2.0.2 grid_4.1.0 glue_1.4.2 R6_2.5.0
[22] processx_3.5.2 sessioninfo_1.1.1 callr_3.7.0 purrr_0.3.4 Rhdf5lib_1.14.0 magrittr_2.0.1 ps_1.6.0
[29] ellipsis_0.3.2 cachem_1.0.5 crayon_1.4.1
I don't know the answer to the question, but I'm wondering if this example code is doing the correct thing. My main concern is that your output has 8 rows. However, this is only due to the fact that
defaultAutoGrid(x)
returns a grid with 8 blocks. If you defined a different grid, or the dimensions ofx
were different, you'd get a different number of rows in the output. Is that really what you want?