Is it possible to create `HDF5Array` Delayed Arrays from HDF5 files hosted on S3?
3
0
Entering edit mode
@thomas-sandmann-6817
Last seen 16 months ago
USA

I am looking for advice on making large collections of data (e.g. analogous to the Archs4 datasets) available to scientists within my group. Like Archs4, we currently store our data in HDF5 files. Individual users have to download the files to their system before they can access them.

I learned that the rhdf5 package, which is used by the HDF5Array package, supports reading HDF5 files directly from the cloud via the HDF5 S3 Virtual File Driver. I also found the rhdf5client package, which supports a DelayedArray backend. But I believe it can only access HDF5 files served via h5serv, a service that we currently don't have set up.

This vignette by the MultiAssay Special Interest Group mentions that

Additional scenarios are currently in development where an HDF5Matrix is hosted remotely.

Questions:

  • Is there already a path to create HDF5Array Delayed Arrays from HDF5 files hosted on S3 for use in Bioconductor objects (SummarizedExperiments, MultiAssayExperiments, etc)?
  • Or are other array-based storage back-ends (e.g. TileDBArray) a better option and I should explore those instead?

Any pointers are much appreciated!

Thomas

TileDBArray HDF5Array MultiAssayExperiment • 2.1k views
ADD COMMENT
4
Entering edit mode
@herve-pages-1542
Last seen 4 days ago
Seattle, WA, United States

Hi Thomas, Aaron, Mike, Marcel,

Starting with HDF5Array 1.19.6, HDF5Array objects can wrap files hosted on Amazon S3:

library(HDF5Array)
public_S3_url <- "https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5"
h5file <- H5File(public_S3_url, s3=TRUE)
h5ls(h5file)
#   group name       otype dclass        dim
# 0     /   a1 H5I_DATASET  FLOAT 5 x 10 x 2
a1 <- HDF5Array(h5file, "a1")
a1[ , 1:4, 2]
# <5 x 4> matrix of class DelayedMatrix and type "double":
#           [,1]      [,2]      [,3]      [,4]
# [1,] 0.6848327 0.1282442 0.7906603 0.9509606
# [2,] 0.6579706 0.4897645 0.3274960 0.4266883
# [3,] 0.7180033 0.4929566 0.2159982 0.8346751
# [4,] 0.8445332 0.8016226 0.8083879 0.6960833
# [5,] 0.4086140 0.3476558 0.7067179 0.6536515

You can't use parallel evaluation (e.g. blockApply(..., BPPARAM=MulticoreParam(3))) on these objects at the moment. See ?HDF5Array for more information.

Thanks Mike for adding support for S3 in rhdf5. Was relatively easy to follow your lead and piggy back on your work ;-)

Cheers,

H.

ADD COMMENT
0
Entering edit mode

That's awesome. Thanks so much for implementing this so quickly, Hervé 🚀! And to Mike for laying the groundwork 👍Much appreciated.

ADD REPLY
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 8 hours ago
The city by the bay

I can answer the second question. In the back half of last year, I did just that with base tiledb on some datasets hosted on our company's internal S3. Digging through the code, apparently this worked:

library(tiledb)
Sys.setenv(AWS_EC2_METADATA_DISABLED=TRUE)
config <- tiledb_config()
config["vfs.s3.region"] <- "us-west-2"
ctx <- tiledb_ctx(config)
system.time(obj <- tiledb_array("s3://tiledb-test/macosko_tdb"))

In theory, this means that it should be similarly possible to just replace the file path with the S3 URI in TileDBArray(), though I haven't tried and I don't have a public dataset on S3 to test it on.

In practice, latency was an issue. That call above required about 4 seconds to do the various handshakes; it took another 5 seconds to actually transfer a small chunk of data. Maybe this is due to some S3 configuration, maybe there's a few optimizations to be done in the R interface, maybe it's just unavoidable when moving data over the wire - I didn't investigate any further.

ADD COMMENT
0
Entering edit mode

Thanks a lot for sharing your experience and your code example, Aaron. That's very helpful!

ADD REPLY
0
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 6 hours ago
EMBL Heidelberg

My guess from the HDF5 side is that this currently isn't possible, but it wouldn't take much work to get it working. From https://support.bioconductor.org/p/9134972/ you've probably seen that rhdf5::h5read() works with S3 storage. Unfortunately in this case, HDF5Array calls it's own reading function internally (h5mread()) that makes some assumptions about the dataset it's accessing being an array. This allows it to be faster than the generic "read-everything" version in rhdf5, but prevents this usecase from just work as-is.

The S3 support in rhdf5 is very new (Windows support was only added to the devel version last week), and I don't think anyone has sat down and worked out exactly what we need to expose from rhdf5 to allow S3 in HDF5Array. However, I can't see a reason in principle we can't get them them working together and I don't think it should be too much effort since we know the infrastructure for accessing HDF5 files on S3 works.

ADD COMMENT
0
Entering edit mode

It is possible when hosting with the HDF Scalable Data Service (HSDS) along with restfulSE and rhdf5client as shown here https://vjcitn.github.io/intmmo/articles/htxlook.html but it does require some extra infrastructure.

It would be good to have an S3 option from HDF5Array.

ADD REPLY

Login before adding your answer.

Traffic: 577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6