Hi all,
I am looking for a way to use HDF5Array and DelayedArray to operate together two large matrices of different dimensions (A and B below) that share equal number of rows and row-chunk geometry.
My immediate solution to this problem was to add zeros to the smaller matrix B (see B0 below), and then operate them together as a 3d array.
Is there a more efficient solution to this problem? I was thinking about something along the lines of a row-chunk iterator that would allow me to operate the two matrices without having to create the array x below.
Thanks!
> library(HDF5Array)
> 
> col_N <- 1e3
> col_n <- 3
> row_N <- 5e4
> 
> A <- matrix(runif(col_N*row_N), ncol = col_N, nrow = row_N)
> B <- matrix(runif(3*row_N,-1,1), ncol = col_n, nrow = row_N)
> 
> B0 <- cbind(B,matrix(0, ncol = col_N-col_n, nrow = row_N))
> 
> x <- writeHDF5Array(x = array(c(A,B0),dim = c(row_N,col_N,2)),
+                     chunkdim = c(5,col_N,2))
> 
> x
<50000 x 1000 x 2> array of class HDF5Array and type "double":
,,1
              [,1]      [,2]      [,3] ...    [,999]   [,1000]
    [1,] 0.9366992 0.3885370 0.9432474   . 0.8581199 0.2343847
    [2,] 0.9540822 0.5233401 0.9275795   . 0.8206745 0.1228269
     ...         .         .         .   .         .         .
[49999,] 0.2729992 0.2575988 0.2134806   . 0.1227398 0.9900415
[50000,] 0.9385657 0.9544607 0.6547487   . 0.5587343 0.4466597
,,2
                [,1]        [,2]        [,3] ...  [,999] [,1000]
    [1,]  -0.6489840   0.2446995   0.9654620   .       0       0
    [2,]  -0.2742762  -0.6363328   0.4642650   .       0       0
     ...           .           .           .   .       .       .
[49999,]  0.00637822  0.06429913 -0.90916511   .       0       0
[50000,] -0.45653045  0.70055948 -0.71097379   .       0       0
> 
> foo <- function(x) {
+   y <- (abs(x[,,1])^0.25 - 1)/0.25
+   
+   out <- lapply(seq_len(col_n),function(z){
+     sum(t(y) %*% x[,z,2])
+   })
+   
+   unlist(out)
+ }
> 
> blockFoo <- function(x) {
+   grid <- defaultAutoGrid(x)
+   block_foo <- blockApply(x, foo, grid = grid)
+   do.call(rbind,block_foo)
+ }
> 
> output <- blockFoo(x)
> 
> output
           [,1]       [,2]       [,3]
[1,]  42437.943 -21488.400  15951.636
[2,]   4590.788  58686.650 -52564.467
[3,]  35634.653 -25044.865  33538.275
[4,] -14556.621  -7956.171  12180.909
[5,] -12395.838 -16234.261  -3642.189
[6,]  49027.775  -3704.667  31522.995
[7,]  51266.787   3508.757 -25403.900
[8,]  24198.451  50484.589 -14383.886
> 
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS:   /home/data/apps/R/R-4.1.0/lib64/R/lib/libRblas.so
LAPACK: /home/data/apps/R/R-4.1.0/lib64/R/lib/libRlapack.so
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] HDF5Array_1.20.0     rhdf5_2.36.0         DelayedArray_0.18.0  IRanges_2.26.0       S4Vectors_0.30.0     MatrixGenerics_1.4.0
 [7] matrixStats_0.59.0   BiocGenerics_0.38.0  Matrix_1.3-4         devtools_2.4.1       usethis_2.0.1       
loaded via a namespace (and not attached):
 [1] compiler_4.1.0     rhdf5filters_1.4.0 prettyunits_1.1.1  remotes_2.4.0      tools_4.1.0        testthat_3.0.2     pkgbuild_1.2.0    
 [8] pkgload_1.2.1      memoise_2.0.0      lifecycle_1.0.0    lattice_0.20-44    rlang_0.4.11       cli_2.5.0          fastmap_1.1.0     
[15] withr_2.4.2        desc_1.3.0         fs_1.5.0           rprojroot_2.0.2    grid_4.1.0         glue_1.4.2         R6_2.5.0          
[22] processx_3.5.2     sessioninfo_1.1.1  callr_3.7.0        purrr_0.3.4        Rhdf5lib_1.14.0    magrittr_2.0.1     ps_1.6.0          
[29] ellipsis_0.3.2     cachem_1.0.5       crayon_1.4.1
                    
                
                
I don't know the answer to the question, but I'm wondering if this example code is doing the correct thing. My main concern is that your output has 8 rows. However, this is only due to the fact that
defaultAutoGrid(x)returns a grid with 8 blocks. If you defined a different grid, or the dimensions ofxwere different, you'd get a different number of rows in the output. Is that really what you want?