Question

Generating a DelayedArray with a size greater than .Machine$integer.max

0

Entering edit mode

Koki ▴ 10

@koki-7888

Last seen 2.3 years ago

Japan

When the DelayedArray is extreamly huge, ReshapedHDF5Array did not work as shown below.

I think the reason is that inside the R language, integers greater than 2^31-1 cannot be handled.

How about making ReshapedHDF5Array accept bit64's integer64? cf. https://stackoverflow.com/questions/14589354/struggling-with-integers-maximum-integer-size

library("HDF5Array")
library("DelayedRandomArray")
library("bit64")

# This does work (2^31-1 > 1000*1000*1000)
dim1 <- c(1000,1000,1000)
l1 <- as.integer64(prod(dim1))

darr1 <- RandomBinomArray(dim=dim1, size=1, prob=0.2)
tmpfile1 <- paste0(tempfile(), ".h5")
writeHDF5Array(darr1, tmpfile1, "tmp")
out1 <- ReshapedHDF5Array(tmpfile1, "tmp", l1)

# This does not work (2^31-1 < 2000*2000*1000)
dim2 <- c(2000,2000,1000)
l2 <- as.integer64(prod(dim2))

darr2 <- RandomBinomArray(dim=dim2, size=1, prob=0.2)
tmpfile2 <- paste0(tempfile(), ".h5")
writeHDF5Array(darr2, tmpfile2, "tmp")
out2 <- ReshapedHDF5Array(tmpfile2, "tmp", l2)
# Error in DelayedArray:::normarg_dim(dim) :
#   'dim' cannot contain negative or NA values
# In addition: Warning message:
# In as.integer.integer64(dim) : NAs produced by integer overflow

sessionInfo( )
# R version 4.1.0 (2021-05-18)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 20.04.2 LTS
# 
# Matrix products: default
# BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
# [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
# [9] LC_ADDRESS=C               LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#
# attached base packages:
# [1] parallel  stats4    stats     graphics  grDevices utils     datasets
# [8] methods   base
#
# other attached packages:
# [1] bit64_4.0.5              bit_4.0.4                DelayedRandomArray_1.1.0
# [4] HDF5Array_1.21.0         rhdf5_2.37.0             DelayedArray_0.19.0
# [7] IRanges_2.27.0           S4Vectors_0.31.0         MatrixGenerics_1.5.0
# [10] matrixStats_0.58.0       BiocGenerics_0.39.0      Matrix_1.3-3
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.6         lattice_0.20-44    rhdf5filters_1.5.0 grid_4.1.0
# [5] dqrng_0.3.0        Rhdf5lib_1.15.0    tools_4.1.0        compiler_4.1.0

Koki

bit64 HDF5Array DelayedArray • 1.8k views

ADD COMMENT • link updated 4.7 years ago by Hervé Pagès 16k • written 4.7 years ago by Koki ▴ 10

0

Entering edit mode

I think this is not a trivial problem, but a problem that is bound to occur at some point when handling some huge array data with DelayedArray. For example, the size of the commonly used 1.3 mouse brain data of 10X Chromium is 27998 genes × 1306127 cell = 3.6 * 1E+10 elements and I think something will go wrong when we write the data of such size into a DelayedArray using block processing. https://bioconductor.org/packages/release/data/experiment/html/TENxBrainData.html

In the first place, why does DelayedArray require the user to specify an integer? I have written a lot of codes like "1L" or "as.integer", but I think it would be better to accept numeric as input and change it to integer inside the DelayedArray. I think we'll have less trouble if we do it that way. I also hope that if the value is greater than 2^31-1, change it to integer64.

ADD REPLY • link 4.7 years ago Koki ▴ 10

0

Entering edit mode

Let's distinguish between an array of length > .Machine$integer.max where all the dimensions are <= .Machine$integer.max (i.e. prod(dim(a)) > .Machine$integer.max && all(dim(a) <= .Machine$integer.max)) and an array where some of the dimensions are > .Machine$integer.max (i.e. any(dim(a) > .Machine$integer.max)). The DelayedArray framework supports the former but not the latter. The TENxBrainData dataset falls in the first category so should be fully supported.

ADD REPLY • link 4.7 years ago Hervé Pagès 16k

score 2 · Accepted Answer · 2021-06-24

Sorry for the late answer.

The DelayedArray framework does not support datasets with dimensions greater than 2^31-1 (.Machine$integer.max) and there's no plan to change this, at least not in the near future. The problem is that supporting such datasets requires a lot more work than simply replacing the dim vector with as.integer64(dim). It requires going deep into many parts of the code, all the way to the C level, not only in DelayedArray but also in HDF5Array and their dependencies (rhdf5, S4Vectors, IRanges, etc...), and make a lot of changes. It's a much bigger endeavor than what most people might think.

In DelayedArray 0.19.1, I've improved the error message you get when calling ReshapedHDF5Array() with a dim vector that contains values > .Machine$integer.max. Admittedly it was not very clear and the extra warning was adding some confusion. Now you'll get:

Error in DelayedArray:::normarg_dim(dim) : 
  'dim' cannot contain values greater than '.Machine$integer.max' (=
  2^31-1 = 2147483647)

H.