I am looking for advice on making large collections of data (e.g. analogous to the Archs4 datasets) available to scientists within my group. Like Archs4, we currently store our data in HDF5 files. Individual users have to download the files to their system before they can access them.
I learned that the rhdf5
package, which is used by the HDF5Array
package, supports reading HDF5 files directly from the cloud via the HDF5 S3 Virtual File Driver
. I also found the rhdf5client package, which supports a DelayedArray
backend. But I believe it can only access HDF5 files served via h5serv, a service that we currently don't have set up.
This vignette by the MultiAssay Special Interest Group mentions that
Additional scenarios are currently in development where an HDF5Matrix is hosted remotely.
Questions:
- Is there already a path to create
HDF5Array
Delayed Arrays from HDF5 files hosted on S3 for use in Bioconductor objects (SummarizedExperiments, MultiAssayExperiments, etc)? - Or are other array-based storage back-ends (e.g. TileDBArray) a better option and I should explore those instead?
Any pointers are much appreciated!
Thomas
That's awesome. Thanks so much for implementing this so quickly, Hervé 🚀! And to Mike for laying the groundwork 👍Much appreciated.