I'd like to get some statistics on performance and maintenance issues of using the delayed HDF5 back end for SummarizedExperiments. This is part of a general project started in the benchOOM package. In that package I tentatively defined a generic "writeHDF5Dataset" and a method for RangedSummarizedExperiment
> bano2 = benchOOM::writeHDF5Dataset(banovichSE, "./banoDSE.h5", "banoDSE")
> bano2
class: DelayedRangedSummarizedExperiment
dim: 329469 64
metadata(0):
assays(1): betas
rownames(329469): cg00000029 cg00000165 ... ch.9.98989607R ch.9.991104F
rowData names(10): addressA addressB ... probeEnd probeTarget
colnames(64): NA18498 NA18499 ... NA18489 NA18909
colData names(35): title geo_accession ... data_row_count naid
It seems to work well. Is this the right place to have a 'write...Dataset' method -- to have it operate on the SummarizedExperiment? If so, perhaps we can have the generic defined in HDF5Array, right now it is just a function.
Once this is determined, I would like to have a simple packaging protocol, so that a 'large' SummarizedExperiment can be packaged with its HDF5 representation, and the S4 wrapper is created on attachment, with pointers to the installed package folder with the HDF5. Or, the HDF5 can be shipped via ExperimentHub -- but are we ready for this, and do we have an example of a data package that relies on this approach?
Finally it seems we are ready for a "distributed" SummarizedExperiment with HDF5Server responding to queries. Hope to get some more information on that this week.
After giving it a 2nd thought, I decided to replace
writeHDF5Dataset()
withwriteHDF5Array()
and to deprecate the former. Returning an HDF5Array object instead of an HDF5ArraySeed makes more sense and covers more use cases. HDF5ArraySeed is a low-level container that the end-user should not need to manipulate directly. It's only used behind the scene to support HDF5Array objects.This change is in HDF5Array 1.3.5.
The packaging of an HDF5-based SummarizedExperiment object is an important question that we discussed a little bit last year with Pete Hickey and Kasper. It seems tricky to tackle this in the more general case of a SummarizedExperiment object that contains a mix of in-memory and on-disk assays and needs to be saved to an arbitrary backend. But maybe supporting the HDF5 case is not that hard. I'll try to come up with a proposal for this.
H.
Hi Vince,
So I added
saveHDF5SummarizedExperiment
/loadHDF5SummarizedExperiment
to SummarizedExperiment 1.5.6 for saving/loading a HDF5-based SummarizedExperiment object to/from disk. It's an attempt at supporting HDF5 packaging of an arbitrary SummarizedExperiment (or derived) object.See
?saveHDF5SummarizedExperiment
for more information.H.
Excellent. Thanks very much.