Generating a DelayedArray with a size greater than .Machine$integer.max
Koki • 0
When the DelayedArray is extreamly huge, ReshapedHDF5Array did not work as shown below.

I think the reason is that inside the R language, integers greater than 2^31-1 cannot be handled.

How about making ReshapedHDF5Array accept bit64's integer64? cf.


# This does work (2^31-1 > 1000*1000*1000)
dim1 <- c(1000,1000,1000)
l1 <- as.integer64(prod(dim1))

darr1 <- RandomBinomArray(dim=dim1, size=1, prob=0.2)
tmpfile1 <- paste0(tempfile(), ".h5")
writeHDF5Array(darr1, tmpfile1, "tmp")
out1 <- ReshapedHDF5Array(tmpfile1, "tmp", l1)

# This does not work (2^31-1 < 2000*2000*1000)
dim2 <- c(2000,2000,1000)
l2 <- as.integer64(prod(dim2))

darr2 <- RandomBinomArray(dim=dim2, size=1, prob=0.2)
tmpfile2 <- paste0(tempfile(), ".h5")
writeHDF5Array(darr2, tmpfile2, "tmp")
out2 <- ReshapedHDF5Array(tmpfile2, "tmp", l2)
# Error in DelayedArray:::normarg_dim(dim) :
#   'dim' cannot contain negative or NA values
# In addition: Warning message:
# In as.integer.integer64(dim) : NAs produced by integer overflow

sessionInfo( )
# R version 4.1.0 (2021-05-18)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 20.04.2 LTS
# Matrix products: default
# BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/
# locale:
# [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
# [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
# [9] LC_ADDRESS=C               LC_TELEPHONE=C
# attached base packages:
# [1] parallel  stats4    stats     graphics  grDevices utils     datasets
# [8] methods   base
# other attached packages:
# [1] bit64_4.0.5              bit_4.0.4                DelayedRandomArray_1.1.0
# [4] HDF5Array_1.21.0         rhdf5_2.37.0             DelayedArray_0.19.0
# [7] IRanges_2.27.0           S4Vectors_0.31.0         MatrixGenerics_1.5.0
# [10] matrixStats_0.58.0       BiocGenerics_0.39.0      Matrix_1.3-3
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.6         lattice_0.20-44    rhdf5filters_1.5.0 grid_4.1.0
# [5] dqrng_0.3.0        Rhdf5lib_1.15.0    tools_4.1.0        compiler_4.1.0


bit64 HDF5Array DelayedArray • 60 views
I think this is not a trivial problem, but a problem that is bound to occur at some point when handling some huge array data with DelayedArray. For example, the size of the commonly used 1.3 mouse brain data of 10X Chromium is 27998 genes × 1306127 cell = 3.6 * 1E+10 elements and I think something will go wrong when we write the data of such size into a DelayedArray using block processing.

In the first place, why does DelayedArray require the user to specify an integer? I have written a lot of codes like "1L" or "as.integer", but I think it would be better to accept numeric as input and change it to integer inside the DelayedArray. I think we'll have less trouble if we do it that way. I also hope that if the value is greater than 2^31-1, change it to integer64.


