Question

Is it possible to create `HDF5Array` Delayed Arrays from HDF5 files hosted on S3?

0

Entering edit mode

Thomas Sandmann ▴ 90

@thomas-sandmann-6817

Last seen 8 months ago

USA

I am looking for advice on making large collections of data (e.g. analogous to the Archs4 datasets) available to scientists within my group. Like Archs4, we currently store our data in HDF5 files. Individual users have to download the files to their system before they can access them.

I learned that the rhdf5 package, which is used by the HDF5Array package, supports reading HDF5 files directly from the cloud via the HDF5 S3 Virtual File Driver. I also found the rhdf5client package, which supports a DelayedArray backend. But I believe it can only access HDF5 files served via h5serv, a service that we currently don't have set up.

This vignette by the MultiAssay Special Interest Group mentions that

Additional scenarios are currently in development where an HDF5Matrix is hosted remotely.

Questions:

Is there already a path to create HDF5Array Delayed Arrays from HDF5 files hosted on S3 for use in Bioconductor objects (SummarizedExperiments, MultiAssayExperiments, etc)?
Or are other array-based storage back-ends (e.g. TileDBArray) a better option and I should explore those instead?

Any pointers are much appreciated!

Thomas

TileDBArray HDF5Array MultiAssayExperiment • 1.7k views

ADD COMMENT • link 3.2 years ago • updated 3.1 years ago Thomas Sandmann ▴ 90

score 4 · Answer 1 · 2021-02-26

Hi Thomas, Aaron, Mike, Marcel,

Starting with HDF5Array 1.19.6, HDF5Array objects can wrap files hosted on Amazon S3:

library(HDF5Array)
public_S3_url <- "https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5"
h5file <- H5File(public_S3_url, s3=TRUE)
h5ls(h5file)
#   group name       otype dclass        dim
# 0     /   a1 H5I_DATASET  FLOAT 5 x 10 x 2
a1 <- HDF5Array(h5file, "a1")
a1[ , 1:4, 2]
# <5 x 4> matrix of class DelayedMatrix and type "double":
#           [,1]      [,2]      [,3]      [,4]
# [1,] 0.6848327 0.1282442 0.7906603 0.9509606
# [2,] 0.6579706 0.4897645 0.3274960 0.4266883
# [3,] 0.7180033 0.4929566 0.2159982 0.8346751
# [4,] 0.8445332 0.8016226 0.8083879 0.6960833
# [5,] 0.4086140 0.3476558 0.7067179 0.6536515

You can't use parallel evaluation (e.g. blockApply(..., BPPARAM=MulticoreParam(3))) on these objects at the moment. See ?HDF5Array for more information.

Thanks Mike for adding support for S3 in rhdf5. Was relatively easy to follow your lead and piggy back on your work ;-)

Cheers,

H.

score 2 · Answer 2 · 2021-02-21

I can answer the second question. In the back half of last year, I did just that with base tiledb on some datasets hosted on our company's internal S3. Digging through the code, apparently this worked:

library(tiledb)
Sys.setenv(AWS_EC2_METADATA_DISABLED=TRUE)
config <- tiledb_config()
config["vfs.s3.region"] <- "us-west-2"
ctx <- tiledb_ctx(config)
system.time(obj <- tiledb_array("s3://tiledb-test/macosko_tdb"))

In theory, this means that it should be similarly possible to just replace the file path with the S3 URI in TileDBArray(), though I haven't tried and I don't have a public dataset on S3 to test it on.

In practice, latency was an issue. That call above required about 4 seconds to do the various handshakes; it took another 5 seconds to actually transfer a small chunk of data. Maybe this is due to some S3 configuration, maybe there's a few optimizations to be done in the R interface, maybe it's just unavoidable when moving data over the wire - I didn't investigate any further.

score 0 · Answer 3 · 2021-02-22

My guess from the HDF5 side is that this currently isn't possible, but it wouldn't take much work to get it working. From https://support.bioconductor.org/p/9134972/ you've probably seen that rhdf5::h5read() works with S3 storage. Unfortunately in this case, HDF5Array calls it's own reading function internally (h5mread()) that makes some assumptions about the dataset it's accessing being an array. This allows it to be faster than the generic "read-everything" version in rhdf5, but prevents this usecase from just work as-is.

The S3 support in rhdf5 is very new (Windows support was only added to the devel version last week), and I don't think anyone has sat down and worked out exactly what we need to expose from rhdf5 to allow S3 in HDF5Array. However, I can't see a reason in principle we can't get them them working together and I don't think it should be too much effort since we know the infrastructure for accessing HDF5 files on S3 works.