"dim<-" against on-disk arrays
1
0
Entering edit mode
Koki ▴ 10
@koki-7888
Last seen 13 months ago
Japan

I can change the dimensions of standard array object by dim function but it seems that on-disk array packages (e.g. DelayedArray, HDF5Array, and TileDBArray) cannot perform this functionality.

Does anyone know of a convenient way to do this?

library("DelayedArray")
library("HDF5Array")
library("TileDBArray")

arr <- array(runif(2*3*4), dim=2:4)
darr <- DelayedArray(arr)
hdf5arr <- as(arr, "HDF5Array")
tilearr <- as(arr, "TileDBArray")

# This can be performed
dim(arr) <- c(2, 3*4)

# These can not be performed
dim(darr) <- c(2, 3*4)
dim(hdf5arr) <- c(2, 3*4)
dim(tilearr) <- c(2, 3*4)

# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo()
R Under development (unstable) (2021-03-18 r80099)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] TileDBArray_1.1.3    HDF5Array_1.19.14    rhdf5_2.35.2
 [4] rTensor_1.4.1        DelayedArray_0.17.10 IRanges_2.25.9
 [7] S4Vectors_0.29.15    MatrixGenerics_1.3.1 matrixStats_0.58.0
[10] BiocGenerics_0.37.2  Matrix_1.3-2         testthat_3.0.2
[13] BiocManager_1.30.12

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         RcppCCTZ_0.2.9     magrittr_2.0.1     bit_4.0.4
 [5] pkgload_1.2.1      nanotime_0.3.2     lattice_0.20-41    R6_2.5.0
 [9] rlang_0.4.10       tools_4.1.0        grid_4.1.0         tiledb_0.9.0
[13] withr_2.4.2        bit64_4.0.5        rprojroot_2.0.2    crayon_1.4.1
[17] Rhdf5lib_1.13.4    rhdf5filters_1.3.4 compiler_4.1.0     desc_1.3.0
[21] zoo_1.8-9
DelayedArray TileDBArray HDF5Array • 1.7k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 18 hours ago
Seattle, WA, United States

Hi,

Support for this sort of reshaping was added in 2019 in HDF5Array as a response to the following request (they have an interesting use case). Take a look at ?ReshapedHDF5Array in HDF5Array and see if that works for you.

After implementing ReshapedHDF5ArraySeed() and ReshapedHDF5Array() I realized that this reshaping could actually be implemented as a delayed operation available via the dim() setter, and this has been on my TODO list since then. However I never got a chance to work on it. The only request I got for this feature so far was from John Muschelli, and John was able to use ReshapedHDF5Array() for his use case.

I'll move this closer to the top of my TODO list.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Thanks for informing ReshapedHDF5Array.

I'll try this one for now.

I'm looking forward to the dim() setter.

Best,

Koki

ADD REPLY
0
Entering edit mode

It would be nice if Reshaping with increasing the number of dimensions could also be added as dim<-.

mat <- array(runif(2*12), dim=c(2,12))
dmat <- DelayedArray(mat)

# Reshaping of standard array
new_modes <- c(2, 3, 4)
dim(mat) <- new_modes

# Reshaping of DelayedArray
tmpfile <- tempfile()
writeHDF5Array(dmat, filepath=tmpfile, name="tmp", as.sparse=TRUE)
out <- ReshapedHDF5Array(tmpfile, "tmp", new_modes)
# Error in find_dims_to_collapse(reshaped_dim, seed@dim) :
#   Trying to set 3 dimensions on an HDF5 dataset with 2 dimensions.
#   Reshaping doesn't support increasing the number of dimensions at the
#   moment.
ADD REPLY
0
Entering edit mode

I will look into this although I don't promise anything. Even though many of the operations supported by standard arrays can be implemented as delayed operations, some of them cannot.

ADD REPLY
0
Entering edit mode

So, how is the smart way to chenge the dimension of DelayedArray for now?

As shown below, I also found that ReshapedHDF5Array cannot be used not only when the dimension becomes larger, but also when the dimension becomes two or more smaller.

arr <- array(runif(2*3*4*5), dim=c(2,3,4,5))
darr <- DelayedArray(arr)

dim(arr) <- c(6, 20)
# These cannot be performed
dim(darr) <- c(6, 20)
tmpfile <- tempfile()
writeHDF5Array(darr, filepath=tmpfile, name="tmp", as.sparse=TRUE)
out <- ReshapedHDF5Array(tmpfile, "tmp", c(6, 20))
# This can be performed
out <- ReshapedHDF5Array(tmpfile, "tmp", c(2, 3, 20))

I tried to use the normal for statement to assign the data sequentially, but it didn't work. I think it probably corresponds to the 1D-style subassignment in the documentation's description of subassignment, but the error message is difficult for me. https://www.bioconductor.org/packages/release/bioc/manuals/DelayedArray/man/DelayedArray.pdf

library("HDF5Array")
.sarray <- function(dim){
    dim <- as.integer(dim)
    setAutoRealizationBackend("HDF5Array")
    sink <- AutoRealizationSink(dim, as.sparse=TRUE)
    close(sink)
    as(sink, "DelayedArray")
}

arr <- array(runif(2*3*4*5), dim=c(2,3,4,5))
darr <- DelayedArray(arr)
out <- .sarray(c(6, 20))
block.size <- 3
for(i in seq_len(floor(length(out)/block.size))){
    start <- 1 + block.size * (i - 1)
    end <- min(start + block.size - 1, length(out))
    out[start:end] <- as.vector(darr[start:end])
}
# Error in `[<-`(`*tmp*`, start:end, value = c(0.190800078911707, 0.138293329160661,  :
#   linear subassignment to a DelayedArray object 'x' (i.e. 'x[i] <-
#   value') is only supported when the subscript 'i' is a logical
#   DelayedArray object of the same dimensions as 'x' and 'value' an
#   ordinary vector of length 1)
ADD REPLY
1
Entering edit mode

As mentioned somewhere else, using delayed subassignments in a loop is almost never a good idea. Not only because it might lead to an "Error: C stack usage is too close to the limit", but also because, even if it works, it will probably be a very inefficient solution. It's almost always better to write things directly to disk as you go instead of using a loop to modify the entire content of a DelayedArray via delayed subassignments.

The writing-to-disk-as-you-go solution looks something like this:

a <- array(1:120, dim=c(2, 3, 4, 5))

library(HDF5Array)

A <- writeHDF5Array(a)

new_dim <- as.integer(c(2*3, 4*5))

setAutoRealizationBackend("HDF5Array")
sink <- AutoRealizationSink(new_dim)

## We're going to write the data to 'sink' one column at a time so we define
## a grid on 'sink' where the blocks are the columns:
sink_grid <- colAutoGrid(sink, ncol=1)

## In addition to the grid on 'sink' (the output), we also need a grid on 'A'
## (the input). We're going to walk on the two grids simultaneously, read a
## block from 'A' and write it to 'sink'. So we need to make sure that we
## define grids that are "aligned", that is, they must define the same number
## of blocks and the blocks in the input must correspond to the blocks in the
## output. This is achieved by defining the following grid for 'A':
A_grid <- RegularArrayGrid(dim(A), spacings=c(2, 3, 1, 1)) 

## Walk on the two grids simultaneously:
for (bid in seq_along(sink_grid)) {
    ## Read block from 'A'.
    A_viewport <- A_grid[[bid]]
    block <- read_block(A, A_viewport)
    ## Reshape it.
    dim(block) <- c(length(block), 1)
    ## Write the reshaped block to 'sink'.
    sink_viewport <- sink_grid[[bid]]
    sink <- write_block(sink, sink_viewport, block)
}

close(sink)
B <- as(sink, "DelayedArray")

Then:

> B
<6 x 20> matrix of class HDF5Matrix and type "double":
      [,1]  [,2]  [,3] ... [,19] [,20]
[1,]     1     7    13   .   109   115
[2,]     2     8    14   .   110   116
[3,]     3     9    15   .   111   117
[4,]     4    10    16   .   112   118
[5,]     5    11    17   .   113   119
[6,]     6    12    18   .   114   120

Note that this approach of walking on two grids simultaneously (one for the input, one for the output) is described in EXAMPLE 2 of the ?write_block man page.

Hope this helps,

H.

ADD REPLY

Login before adding your answer.

Traffic: 915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6