I am looking for advice on making large collections of data (e.g. analogous to the Archs4 datasets) available to scientists within my group. Like Archs4, we currently store our data in HDF5 files. Individual users have to download the files to their system before they can access them.
I learned that the
rhdf5 package, which is used by the
HDF5Array package, supports reading HDF5 files directly from the cloud via the
HDF5 S3 Virtual File Driver. I also found the rhdf5client package, which supports a
DelayedArray backend. But I believe it can only access HDF5 files served via h5serv, a service that we currently don't have set up.
This vignette by the MultiAssay Special Interest Group mentions that
Additional scenarios are currently in development where an HDF5Matrix is hosted remotely.
- Is there already a path to create
HDF5ArrayDelayed Arrays from HDF5 files hosted on S3 for use in Bioconductor objects (SummarizedExperiments, MultiAssayExperiments, etc)?
- Or are other array-based storage back-ends (e.g. TileDBArray) a better option and I should explore those instead?
Any pointers are much appreciated!