When the DelayedArray
is extreamly huge, ReshapedHDF5Array
did not work as shown below.
I think the reason is that inside the R language, integers greater than 2^31-1 cannot be handled.
How about making ReshapedHDF5Array
accept bit64's integer64?
cf. https://stackoverflow.com/questions/14589354/struggling-with-integers-maximum-integer-size
library("HDF5Array")
library("DelayedRandomArray")
library("bit64")
# This does work (2^31-1 > 1000*1000*1000)
dim1 <- c(1000,1000,1000)
l1 <- as.integer64(prod(dim1))
darr1 <- RandomBinomArray(dim=dim1, size=1, prob=0.2)
tmpfile1 <- paste0(tempfile(), ".h5")
writeHDF5Array(darr1, tmpfile1, "tmp")
out1 <- ReshapedHDF5Array(tmpfile1, "tmp", l1)
# This does not work (2^31-1 < 2000*2000*1000)
dim2 <- c(2000,2000,1000)
l2 <- as.integer64(prod(dim2))
darr2 <- RandomBinomArray(dim=dim2, size=1, prob=0.2)
tmpfile2 <- paste0(tempfile(), ".h5")
writeHDF5Array(darr2, tmpfile2, "tmp")
out2 <- ReshapedHDF5Array(tmpfile2, "tmp", l2)
# Error in DelayedArray:::normarg_dim(dim) :
# 'dim' cannot contain negative or NA values
# In addition: Warning message:
# In as.integer.integer64(dim) : NAs produced by integer overflow
sessionInfo( )
# R version 4.1.0 (2021-05-18)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 20.04.2 LTS
#
# Matrix products: default
# BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
# [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#
# attached base packages:
# [1] parallel stats4 stats graphics grDevices utils datasets
# [8] methods base
#
# other attached packages:
# [1] bit64_4.0.5 bit_4.0.4 DelayedRandomArray_1.1.0
# [4] HDF5Array_1.21.0 rhdf5_2.37.0 DelayedArray_0.19.0
# [7] IRanges_2.27.0 S4Vectors_0.31.0 MatrixGenerics_1.5.0
# [10] matrixStats_0.58.0 BiocGenerics_0.39.0 Matrix_1.3-3
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.6 lattice_0.20-44 rhdf5filters_1.5.0 grid_4.1.0
# [5] dqrng_0.3.0 Rhdf5lib_1.15.0 tools_4.1.0 compiler_4.1.0
Koki
I think this is not a trivial problem, but a problem that is bound to occur at some point when handling some huge array data with DelayedArray. For example, the size of the commonly used 1.3 mouse brain data of 10X Chromium is 27998 genes × 1306127 cell = 3.6 * 1E+10 elements and I think something will go wrong when we write the data of such size into a DelayedArray using block processing. https://bioconductor.org/packages/release/data/experiment/html/TENxBrainData.html
In the first place, why does DelayedArray require the user to specify an integer? I have written a lot of codes like "1L" or "as.integer", but I think it would be better to accept numeric as input and change it to integer inside the DelayedArray. I think we'll have less trouble if we do it that way. I also hope that if the value is greater than 2^31-1, change it to integer64.
Let's distinguish between an array of length >
.Machine$integer.max
where all the dimensions are <=.Machine$integer.max
(i.e.prod(dim(a)) > .Machine$integer.max && all(dim(a) <= .Machine$integer.max)
) and an array where some of the dimensions are >.Machine$integer.max
(i.e.any(dim(a) > .Machine$integer.max)
). The DelayedArray framework supports the former but not the latter. The TENxBrainData dataset falls in the first category so should be fully supported.