Tutorial: "developing" with HDF5Array
0
2.5 years ago by
United States
Vincent J. Carey, Jr.6.3k wrote:

I'd like to get some statistics on performance and maintenance issues of using the delayed HDF5 back end for SummarizedExperiments.  This is part of a general project started in the benchOOM package.  In that package I tentatively defined a generic "writeHDF5Dataset" and a method for RangedSummarizedExperiment

> bano2 = benchOOM::writeHDF5Dataset(banovichSE, "./banoDSE.h5", "banoDSE") > bano2 class: DelayedRangedSummarizedExperiment dim: 329469 64 metadata(0): assays(1): betas rownames(329469): cg00000029 cg00000165 ... ch.9.98989607R ch.9.991104F rowData names(10): addressA addressB ... probeEnd probeTarget colnames(64): NA18498 NA18499 ... NA18489 NA18909 colData names(35): title geo_accession ... data_row_count naid

It seems to work well.  Is this the right place to have a 'write...Dataset' method -- to have it operate on the SummarizedExperiment?  If so, perhaps we can have the generic defined in HDF5Array, right now it is just a function.

Once this is determined, I would like to have a simple packaging protocol, so that a 'large' SummarizedExperiment can be packaged with its HDF5 representation, and the S4 wrapper is created on attachment, with pointers to the installed package folder with the HDF5.  Or, the HDF5 can be shipped via ExperimentHub -- but are we ready for this, and do we have an example of a data package that relies on this approach?

Finally it seems we are ready for a "distributed" SummarizedExperiment with HDF5Server responding to queries.  Hope to get some more information on that this week.

modified 2.5 years ago • written 2.5 years ago by Vincent J. Carey, Jr.6.3k
1
2.5 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Vince,

assay(se) <- HDF5Array(writeHDF5Dataset(assay(se), "./banoDSE.h5", "banoDSE"))

Not super convenient though. So I'm going to add writeHDF5Array() to the HDF5Array package. It's going to do HDF5Array(writeHDF5Dataset(...)), that is, the same as writeHDF5Dataset() but will return an HDF5Array object instead of an HDF5Dataset object. Then we'll be able to do:

assay(se) <- writeHDF5Array(assay(se), "./banoDSE.h5", "banoDSE")

With this approach we can cherry pick the assays that we want to store in HDF5 in case se has more than one assay e.g.:

assays(se)$counts <- writeHDF5Array(assays(se)$counts, "./banoDSE.h5", "counts")
assays(se)$mu <- writeHDF5Array(assays(se)$mu, "./banoDSE.h5", "mu")
etc...

"writeHDF5Dataset" method for RangedSummarizedExperiment objects does not provide that level of control (and it's actually not clear what it would do exactly on a multi-assay RangedSummarizedExperiment object).

Also I hope we can avoid the cost of introducing dedicated Delayed* classes for SummarizedExperiment and all its SummarizedExperiment derivatives. The SummarizedExperiment container was designed from the very start to support alternative representation/storage of the assays. The same object can actually contain assays that mix different types of storage. Is a DelayedRangedSummarizedExperiment object a RangedSummarizedExperiment with all its assays delayed or with at least one of its assays delayed?

H.

After giving it a 2nd thought, I decided to replace writeHDF5Dataset() with writeHDF5Array() and to deprecate the former. Returning an HDF5Array object instead of an HDF5ArraySeed makes more sense and covers more use cases. HDF5ArraySeed is a low-level container that the end-user should not need to manipulate directly. It's only used behind the scene to support HDF5Array objects.

This change is in HDF5Array 1.3.5.

The packaging of an HDF5-based SummarizedExperiment object is an important question that we discussed a little bit last year with Pete Hickey and Kasper. It seems tricky to tackle this in the more general case of a SummarizedExperiment object that contains a mix of in-memory and on-disk assays and needs to be saved to an arbitrary backend. But maybe supporting the HDF5 case is not that hard. I'll try to come up with a proposal for this.

H.

1

Hi Vince,

So I added saveHDF5SummarizedExperiment/loadHDF5SummarizedExperiment to SummarizedExperiment 1.5.6 for saving/loading a HDF5-based SummarizedExperiment object to/from disk. It's an attempt at supporting HDF5 packaging of an arbitrary SummarizedExperiment (or derived) object.

See ?saveHDF5SummarizedExperiment for more information.

H.

Excellent.  Thanks very much.

0
2.5 years ago by
United States
Vincent J. Carey, Jr.6.3k wrote:

Thanks Hervé.  I am not considering the cases of mixed assay representation, or even multiple assays.  At the moment I am trying to get some realistic performance appraisals of the simple hybrid situation: single large assay in HDF5, everything else in the standard RAM-based representation.   I agree with you that the Delayed...SummarizedExperiment should not be necessary.  But probably we need some kind of marker class to indicate the back end?  At the moment I would think that assuming all assays have the same back end is not too restrictive -- but to be completely clear, this is in the context of a single SummarizedExperiment containing multiple mutually conformant assays and not a MultiAssayExperiment (as defined in the eponymous package, which caters for discrepancies between samples, features, and representations of multiple assays).

To continue the experiments in benchOOM I will introduce an HDFSummarizedExperiment class there and a coercion to transform a standard SummarizedExperiment to that form, recognizing that the more general solution will emerge elsewhere.

benchOOM has been updated to include rhconvert, which converts the assay portion of a RangedSummarizedExperiment to HDF5Array (delayed) and returns a SummarizedExperiment with this alternate representation of the assay data.  (rhconvert = "RangedSummarizedExperiment to HDF5 conversion")

1

Hi Vince,

Check the "realize" method for SummarizedExperiment objects that I added yesterday to SummarizedExperiment 1.5.5. It might be doing what rhconvert() does.

H.