I'd like to get some statistics on performance and maintenance issues of using the delayed HDF5 back end for SummarizedExperiments. This is part of a general project started in the benchOOM package. In that package I tentatively defined a generic "writeHDF5Dataset" and a method for RangedSummarizedExperiment
> bano2 = benchOOM::writeHDF5Dataset(banovichSE, "./banoDSE.h5", "banoDSE")
dim: 329469 64
rownames(329469): cg00000029 cg00000165 ... ch.9.98989607R ch.9.991104F
rowData names(10): addressA addressB ... probeEnd probeTarget
colnames(64): NA18498 NA18499 ... NA18489 NA18909
colData names(35): title geo_accession ... data_row_count naid
It seems to work well. Is this the right place to have a 'write...Dataset' method -- to have it operate on the SummarizedExperiment? If so, perhaps we can have the generic defined in HDF5Array, right now it is just a function.
Once this is determined, I would like to have a simple packaging protocol, so that a 'large' SummarizedExperiment can be packaged with its HDF5 representation, and the S4 wrapper is created on attachment, with pointers to the installed package folder with the HDF5. Or, the HDF5 can be shipped via ExperimentHub -- but are we ready for this, and do we have an example of a data package that relies on this approach?
Finally it seems we are ready for a "distributed" SummarizedExperiment with HDF5Server responding to queries. Hope to get some more information on that this week.