Building an HDF5Array/DelayedArray object incrementally
1
3
Entering edit mode
@stephen-piccolo-6761
Last seen 4.2 years ago
United States

Thanks to those who have provided the excellent DelayedArray (and related) packages. I've found some helpful examples online for converting a matrix to objects with back ends such as HDF5Array and then querying such datasets. But so far I haven't found much about building an HDF5Array object when the source data is too large to store in a matrix. Can I build that row by row, for example? Do you happen to have any examples for doing this?

Sorry if I missed something. Many thanks,

-Steve

delayedarray • 1.3k views
ADD COMMENT
5
Entering edit mode
@james-w-macdonald-5106
Last seen 3 days ago
United States

There may be much smarter ways to do this (I'm taking the over!), but what I have done in the past is to create a HDF5 file, that has pre-specified dimensions, and then sticking matrices of the data in, one at a time. So something like

library(HDF5Array)
fn <- <file name goes here>
h5createFile(fn)
h5createDataset(fn, "somename", c(NCOL, NROW), <other args go here>)
for(i in 1:num_matrices){
   mat <- <make a matrix of your data, or read it in or whatever>
   h5write(mat, fn, "somename", FALSE, index = list(<rows the data are going into>, <columns the data are going into>))
}
H5close()

You might want to specify what sort of data you will be adding, using the storage.mode argument for h5createDataset, and the chunk sizes you want to be able to read back out. Then the only trick is to get the row and column counting right.

I used this general idea for a collaborator who wanted a SummarizedExperiment, where for any common biallelic SNP (like 8 million or so? I forget), and any tissue from the GTEx consortium, there would be a boolean saying if the SNP overlapped a given eQTL. Since I couldn't create a matrix that large in R, I just generated in chunks and put into the HDF5 file, then wrapped in the SummarizedExperiment.

ADD COMMENT
1
Entering edit mode

In hcreateDataset, the third argument is c(NROW, NCOL) for the total number of rows and columns you will need. Not the opposite as I showed in my pseudo-code

ADD REPLY
1
Entering edit mode

As an actual example

> library(HDF5Srray)
> fn <- "assay.h5"
> h5createFile(fn)
[1] TRUE
> h5createDataset(fn, "assay1", c(1e6, 50), storage.mode = "double", chunk = c(1000, 50))
[1] TRUE
> for(i in 1:100){
   mat <- matrix(rnorm(50000), ncol = 50)
   h5write(mat, fn, "assay1", FALSE, index = list(c((i*1000-999):(i*1000)), 1:50))
}
> H5close()

> h5read(fn, "assay1", index = list(1:10, 1:10))
             [,1]        [,2]       [,3]        [,4]        [,5]        [,6]
 [1,]  0.43608902  0.14868494  0.2076124  0.30612972  1.48927991  1.28171956
 [2,] -2.50603838  1.70552861  0.1827294 -2.58016124 -1.31100712 -3.01365010
 [3,] -0.08183648 -1.13521540 -1.0339713 -0.51565362  1.11343300  0.30785107
 [4,]  0.42014308  1.19301767 -1.0821754 -1.11207395  2.42150819  1.84740143
 [5,] -0.32782979 -0.05638034  1.4460466 -1.12444421  0.47713206  1.75111987
 [6,]  0.85734272  0.98783108  0.3756628 -0.05549879 -0.19730422 -0.07823592
 [7,] -0.48553100 -1.17610537  0.8481388  0.02217590 -0.07551573  0.06593631
 [8,] -0.25455525  0.96053929 -0.4527501 -0.02715035 -0.54007328 -1.10289965
 [9,]  1.27365793 -0.66572171 -0.2347437 -1.65024175 -1.49246802  0.46727486
[10,]  0.68927445 -1.17478332  1.2394413 -0.10118879 -0.27118213  0.65316317
             [,7]       [,8]       [,9]      [,10]
 [1,] -1.33998490  0.3087755 -1.6386299  0.1843357
 [2,] -0.17659688  1.3342337  0.9260548 -0.1244211
 [3,]  0.84151392 -0.4823801  0.8476953 -1.4324426
 [4,]  0.42939169  0.5100834  0.1516622  0.2288680
 [5,] -0.13985150  0.3642955  0.1742245  1.7998314
 [6,]  0.17610892 -0.5977365 -0.2504903  0.1182326
 [7,]  0.28518733  0.6681628 -0.8443199 -0.2907716
 [8,] -1.33097447 -0.7483504 -0.1198190  0.6977359
 [9,] -0.04961757 -0.1310102  1.3314621 -0.9943571
[10,] -1.62339764 -0.5002202  0.2820152  0.1747318

ADD REPLY
1
Entering edit mode

This the way I'd do it too. I'd only add a couple of things:

  • For completeness you can then construct the actual HDF5Array via:
> HDF5Array(fn, "assay1")
<1000000 x 50> matrix of class HDF5Matrix and type "double":
                 [,1]       [,2]       [,3] ...         [,49]         [,50]
      [1,]  0.4648529 -0.8981272  0.1944075   . -0.0612682904 -0.8569700731
      [2,]  1.2365094  2.6392853 -0.3395745   . -0.5094227072 -0.7216030999
      [3,] -0.7585193  0.1598974 -1.7107732   .  0.2070367919  2.8175596120
      [4,]  0.6978633 -0.3185115 -1.6694794   .  0.0007597542  0.0996235416
      [5,] -1.2094602  0.1478289  1.3072060   . -0.3733785167  1.6689906259
       ...          .          .          .   .             .             .
 [999996,]          0          0          0   .             0             0
 [999997,]          0          0          0   .             0             0
 [999998,]          0          0          0   .             0             0
 [999999,]          0          0          0   .             0             0
[1000000,]          0          0          0   .             0             0
  • I'd avoid using H5close(). I think it can be a bit heavy handed and under some circumstances will close the underlying HDF5 library, which then breaks the R/HDF5 interface and results in weird errors that are hard to debug. In this code example you shouldn't have to use anything as h5write() will tidy up after itself, but there's also h5closeAll() if you want to be certain all files have been flushed and closed.
ADD REPLY
0
Entering edit mode

Thanks for the advice about H5close. I was under the impression that you had to tell the HDF5 library to 'finish things off', but if h5write already pretty much handles that, all the better.

ADD REPLY

Login before adding your answer.

Traffic: 523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6