Generating a DelayedArray with a size greater than .Machine$integer.max 1 0 Entering edit mode Koki ▴ 10 @koki-7888 Last seen 7 days ago Japan When the DelayedArray is extreamly huge, ReshapedHDF5Array did not work as shown below. I think the reason is that inside the R language, integers greater than 2^31-1 cannot be handled. How about making ReshapedHDF5Array accept bit64's integer64? cf. https://stackoverflow.com/questions/14589354/struggling-with-integers-maximum-integer-size library("HDF5Array") library("DelayedRandomArray") library("bit64") # This does work (2^31-1 > 1000*1000*1000) dim1 <- c(1000,1000,1000) l1 <- as.integer64(prod(dim1)) darr1 <- RandomBinomArray(dim=dim1, size=1, prob=0.2) tmpfile1 <- paste0(tempfile(), ".h5") writeHDF5Array(darr1, tmpfile1, "tmp") out1 <- ReshapedHDF5Array(tmpfile1, "tmp", l1) # This does not work (2^31-1 < 2000*2000*1000) dim2 <- c(2000,2000,1000) l2 <- as.integer64(prod(dim2)) darr2 <- RandomBinomArray(dim=dim2, size=1, prob=0.2) tmpfile2 <- paste0(tempfile(), ".h5") writeHDF5Array(darr2, tmpfile2, "tmp") out2 <- ReshapedHDF5Array(tmpfile2, "tmp", l2) # Error in DelayedArray:::normarg_dim(dim) : # 'dim' cannot contain negative or NA values # In addition: Warning message: # In as.integer.integer64(dim) : NAs produced by integer overflow sessionInfo( ) # R version 4.1.0 (2021-05-18) # Platform: x86_64-pc-linux-gnu (64-bit) # Running under: Ubuntu 20.04.2 LTS # # Matrix products: default # BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so # # locale: # [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C # [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 # [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C # [7] LC_PAPER=en_US.UTF-8 LC_NAME=C # [9] LC_ADDRESS=C LC_TELEPHONE=C # [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C # # attached base packages: # [1] parallel stats4 stats graphics grDevices utils datasets # [8] methods base # # other attached packages: # [1] bit64_4.0.5 bit_4.0.4 DelayedRandomArray_1.1.0 # [4] HDF5Array_1.21.0 rhdf5_2.37.0 DelayedArray_0.19.0 # [7] IRanges_2.27.0 S4Vectors_0.31.0 MatrixGenerics_1.5.0 # [10] matrixStats_0.58.0 BiocGenerics_0.39.0 Matrix_1.3-3 # # loaded via a namespace (and not attached): # [1] Rcpp_1.0.6 lattice_0.20-44 rhdf5filters_1.5.0 grid_4.1.0 # [5] dqrng_0.3.0 Rhdf5lib_1.15.0 tools_4.1.0 compiler_4.1.0  Koki bit64 HDF5Array DelayedArray • 144 views ADD COMMENT 0 Entering edit mode I think this is not a trivial problem, but a problem that is bound to occur at some point when handling some huge array data with DelayedArray. For example, the size of the commonly used 1.3 mouse brain data of 10X Chromium is 27998 genes × 1306127 cell = 3.6 * 1E+10 elements and I think something will go wrong when we write the data of such size into a DelayedArray using block processing. https://bioconductor.org/packages/release/data/experiment/html/TENxBrainData.html In the first place, why does DelayedArray require the user to specify an integer? I have written a lot of codes like "1L" or "as.integer", but I think it would be better to accept numeric as input and change it to integer inside the DelayedArray. I think we'll have less trouble if we do it that way. I also hope that if the value is greater than 2^31-1, change it to integer64. ADD REPLY 0 Entering edit mode Let's distinguish between an array of length > .Machine$integer.max where all the dimensions are <= .Machine$integer.max (i.e. prod(dim(a)) > .Machine$integer.max && all(dim(a) <= .Machine$integer.max)) and an array where some of the dimensions are > .Machine$integer.max (i.e. any(dim(a) > .Machine$integer.max)). The DelayedArray framework supports the former but not the latter. The TENxBrainData dataset falls in the first category so should be fully supported. ADD REPLY 2 Entering edit mode @herve-pages-1542 Last seen 1 day ago Seattle, WA, United States Sorry for the late answer. The DelayedArray framework does not support datasets with dimensions greater than 2^31-1 (.Machine$integer.max) and there's no plan to change this, at least not in the near future. The problem is that supporting such datasets requires a lot more work than simply replacing the dim vector with as.integer64(dim). It requires going deep into many parts of the code, all the way to the C level, not only in DelayedArray but also in HDF5Array and their dependencies (rhdf5, S4Vectors, IRanges, etc...), and make a lot of changes. It's a much bigger endeavor than what most people might think.

In DelayedArray 0.19.1, I've improved the error message you get when calling ReshapedHDF5Array() with a dim vector that contains values > .Machine$integer.max. Admittedly it was not very clear and the extra warning was adding some confusion. Now you'll get: Error in DelayedArray:::normarg_dim(dim) : 'dim' cannot contain values greater than '.Machine$integer.max' (=
2^31-1 = 2147483647)


H.