2
1
Entering edit mode
@felicienmeunier-14964
Last seen 4.1 years ago

For research purposes, I must read a lot of large hdf5 files generated by an environmental model. Right now, I open them in rstudio using the rhdf5 package. I recently noticed that the most time-limiting step was the reading of the hdf5 files and so I wanted to speed up that part of the code. Switching from rhdf5 to the old hdf5 package, the speed was considerably reduced (it goes something like ten times faster). Is there any reasonable explanation that explains such a speed difference? I would prefer to keep rhdf5 as it is still supported, so is there a way to reach better performances? As it could be due to some bad practice, please find below the two lines of code I use to open and read the hdf5 files:

With rhdf5(slow)

mymont    = h5read(h5file,'/')

and with hdf5 (fast):

mymont    = hdf5load(file=h5file,load=FALSE,verbosity=0,tidy=TRUE)
rhdf5 hdf5 • 2.9k views
1
Entering edit mode

This isn't a particularly good example because 'toy.h5' is small, but

fl = system.file(package="HDF5Array", "extdata", "toy.h5")
microbenchmark(f0(), f1(), times=5)

> microbenchmark(f0(), f1(), times=5)
Unit: milliseconds
expr      min       lq      mean   median       uq       max neval cld
f0() 93.48307 94.90350 100.62300 95.84296 95.87789 123.00760     5   b
f1() 70.55737 70.90299  72.31741 72.23494 72.69283  75.19891     5  a 

So rhdf5 is faster. More important, though, is that hdf5 reads the data literally, wheres rhdf5 transposes it (this is by design)

> str(f1())
List of 2
$M1: num [1:10000, 1:150] 0.44 0.189 0.992 0.526 0.624 ...$ M2: num [1:150, 1:200] 9.033 0.56 8.109 5.852 -0.821 ...
> str(f0())
Dotted pair list of 2
$M2: num [1:200, 1:150] 9.033 -0.837 -3.066 14.924 -4.724 ...$ M1: num [1:150, 1:10000] 0.44 0.023 0.117 0.549 0.525 ...

So I wonder whether you have a benchmark file that you'd be willing to share?

> packageVersion("rhdf5")
[1] '2.23.5'
> packageVersion("hdf5")
[1] '1.6.9'

2
Entering edit mode

Yes, sure.

I tested the two functions over the full range of file size and types:

library(rhdf5)
library(hdf5)
library(microbenchmark)

direct = '/home/femeunier/Documents/ED2/ED/run/paracou/analy/'
direct2= '/home/femeunier/Documents/ED2_archives/Marcos/ED2/ED/run/amazon/nolianas/'

small_fl = file.path(direct,'paracou-Q-2000-03-00-000000-g01.h5')
normal_fl = file.path(direct,'paracou-Q-2036-01-00-000000-g01.h5')
S_fl = file.path(direct2,'amazon-S-2029-07-01-000000-g01.h5')

> microbenchmark(f0(), f1(), times=5)
Unit: milliseconds
expr        min         lq       mean     median         uq        max neval
f0()   25.24533   26.10443   27.34176   26.78865   28.15434   30.41607     5
f1() 1063.33951 1108.48518 1166.39708 1176.59489 1200.20679 1283.35904     5

> microbenchmark(f2(), f3(), times=5)
Unit: milliseconds
expr        min         lq       mean     median         uq        max neval
f2()   35.69882   36.25049   46.51748   36.59257   38.60065   85.44488     5
f3() 1083.67681 1106.77546 1148.52053 1116.24633 1164.70662 1271.19742     5

> microbenchmark(f4(), f5(), times=5)
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
f4() 269.1774 279.5624 296.0794 292.2530 316.0172 323.3869     5
f5() 714.0194 726.4895 775.2907 742.2052 769.6752 924.0642     5

It appears that hdf5 is always faster but for the second type ('S' instead of 'Q' files), the time difference is less critical. Anyway, I must read everything ('S' and 'Q' files) and almost the whole hdf5 content.

https://www.dropbox.com/s/jm1evy3nur8hio7/paracou-Q-2036-01-00-000000-g01.h5?dl=0

> packageVersion("rhdf5")
[1] ‘2.22.0’
> packageVersion("hdf5")
[1] ‘1.6.10’

2
Entering edit mode

Thanks for the follow up with your data I'll take a look at it in the next few days and try to offer some insight.  I just did some testing with the code below to read a 'large' integer matrix where rhdf5 seemed to be quicker, so there's definitely a few things to look at.  This is with the devel version of rhdf5 that I mentioned.

Out of interest, did you create version 1.6.10 of hdf5 yourself?  The last I can find is 1.6.9, but I had to modify the configure script to get it to install.

library(rhdf5)
library(hdf5)
library(microbenchmark)

## large matrix
dat <- matrix(1L:50000000L, ncol = 500)

## specify file
h5File <- tempfile(fileext = ".h5")
if(file.exists(h5File))
file.remove(h5File)

## open file
fid <- H5Fcreate(name = h5File)

## create dataset with GZIP compression
sid <- H5Screate_simple(dims = dim(dat), maxdims = dim(dat))
pid <- H5Pcreate(type = "H5P_DATASET_CREATE")
H5Pset_deflate( pid, level = 6 )
H5Pset_chunk(pid, c(100, 100))
did <- H5Dcreate(h5loc = fid, name = "blosc",
dtype_id = "H5T_NATIVE_INT32",
sid, dcpl = pid)
H5Dwrite(did, buf = dat)

## tidy up
H5Dclose(did)
H5Pclose(pid)
H5Sclose(sid)
H5Fclose(fid)


> microbenchmark(f0(), f1(), times=5)
Unit: seconds
expr      min      lq     mean   median       uq      max neval
f0() 1.645665 1.68502 1.707856 1.720116 1.728094 1.760384     5
f1() 1.012661 1.03006 1.043414 1.041873 1.046944 1.085533     5
> packageVersion("rhdf5")
[1] '2.23.6'
> packageVersion("hdf5")
[1] '1.6.9'
0
Entering edit mode

Thank for the reply and the time you dedicate to my problem. Let me know if you find anything.

No, I did not create the 1.6.10 version of the package. As it was a while ago, I can remember where exactly by I downloaded it online. Maybe here:

https://packages.debian.org/fr/jessie/r-cran-hdf5

4
Entering edit mode
Mike Smith ★ 5.5k
@mike-smith
Last seen 16 minutes ago
EMBL Heidelberg / de.NBI

It looks like your files are quite different in structure from my 'large matrix' example. You have hundreds of small datasets, and h5read() takes a relatively long time to recurse the file and read them individually.

Here's a quick an dirty function that simplifies the reading for your file type, and performs a fair bit quicker than h5read(), although still nowhere near as fast as h5load() . It assumes that you have a flat heirachy in your h5 files and that you don't want anything extra like the attribute data.

library(rhdf5)
library(hdf5)
library(microbenchmark)

h5File <- "paracou-Q-2036-01-00-000000-g01.h5"

verbosity=0, tidy=TRUE)

h5read_optimised <- function( h5File ) {

dset_names <- h5ls(h5File, recursive = FALSE, datasetinfo = FALSE)$name fid <- H5Fopen(h5File) contents <- sapply(dset_names, FUN = function(fid, dset_name) { did <- H5Dopen(fid, name = dset_name) res <- H5Dread(did) H5Dclose(did) res }, fid = fid) H5Fclose(fid) return(contents) } f2 <- function() h5read_optimised(h5File)  > microbenchmark(f0(), f1(), f2(), times = 5) Unit: milliseconds expr min lq mean median uq max neval f0() 47.07749 47.11715 48.1333 47.2934 48.24441 50.93405 5 f1() 1576.38092 1680.43496 1680.1064 1683.5432 1694.62365 1765.54923 5 f2() 540.33196 544.24837 562.4359 553.9433 577.82355 595.83262 5 > identical(f1(), f2()) [1] TRUE  The remaining time diffence is mostly spent by rhdf5 being very careful and checking everytime it is passed a file or dataset handle that the type is correct. For the single large matrix example it only does this once, and spends the rest of the time reading, but for your files it does this check hundreds of times and spends a significant portion of the runtime there. Here's a version where all the checking is removed, and you're right down at the interface with the C functions. Almost all of this is undocumented, and it's obviously tuned to the structure of the file you shared using the default arguments, but it's pretty quick. Based on this I am wondering whether making this type checking could be optional, so some internal functions can skip it. h5read_optimised_more <- function( h5file ) { dset_names <- h5ls(h5File, recursive = FALSE, datasetinfo = FALSE)$name

fid <- H5Fopen(h5File)
dapl <- H5Pcreate("H5P_DATASET_ACCESS")
contents <- sapply(dset_names, FUN = function(fid, dset_name) {
did <- .Call("_H5Dopen", fid@ID, dset_name, dapl@ID, PACKAGE='rhdf5')
res <- .Call("_H5Dread", did, NULL, NULL, NULL, TRUE, 0L, FALSE, PACKAGE='rhdf5')
invisible(.Call("_H5Dclose", did, PACKAGE='rhdf5'))
return(res)
}, fid = fid)

H5Pclose(dapl)
H5Fclose(fid)

return(contents)
}


> microbenchmark(f0(), f1(), f2(), f3(), times = 5)
Unit: milliseconds
expr        min         lq       mean     median         uq        max neval
f0()   48.79590   49.02661   52.35011   51.39961   52.22597   60.30245     5
f1() 1544.00682 1550.06203 1576.59195 1573.27627 1603.67828 1611.93634     5
f2()  539.84307  562.46172  576.96321  566.65601  598.58858  617.26666     5
f3()   37.99232   38.52886   39.36942   39.39166   40.29512   40.63916     5
> identical(f1(), f3())
[1] TRUE

1
Entering edit mode

Thanks for the follow up. For sake of exhaustivity, here is the result for the different file types:

> library(rhdf5)
> library(hdf5)
> library(microbenchmark)
>
> direct = '/home/femeunier/Documents/ED2/ED/run/paracou/analy'
> direct2= '/home/femeunier/Documents/ED2_archives/Marcos/ED2/ED/run/amazon/nolianas'
>
> small_fl = file.path(direct,'paracou-Q-2000-03-00-000000-g01.h5')
> normal_fl = file.path(direct,'paracou-Q-2036-01-00-000000-g01.h5')
> S_fl = file.path(direct2,'amazon-S-2029-07-01-000000-g01.h5')
>
> f1 = function() h5read(small_fl, "/")
> microbenchmark(f0(), f1(), f2(), times=5)
Unit: milliseconds
expr        min         lq       mean     median         uq        max neval
f0()   25.40139   25.68693   26.19437   26.26390   26.74705   26.87255     5
f1() 1070.05643 1100.54817 1122.80164 1107.79444 1158.34030 1177.26885     5
f2()   19.45596   20.18687   20.21286   20.32607   20.36325   20.73216     5
>
> f4 = function() h5read(normal_fl, "/")
> microbenchmark(f3(), f4(), f5(), times=5)
Unit: milliseconds
expr        min         lq       mean     median         uq        max neval
f3()   36.99810   37.04230   37.94400   37.20590   37.24181   41.23188     5
f4() 1076.31622 1092.00438 1113.71168 1123.64652 1134.54749 1142.04376     5
f5()   22.15244   22.32525   22.57763   22.34158   22.79100   23.27790     5
>
> f7 = function() h5read(S_fl, "/")
> microbenchmark(f6(), f7(), f8(), times=5)
Unit: milliseconds
expr       min        lq      mean    median       uq      max neval
f6() 241.96867 244.93324 272.00056 251.00154 279.3269 342.7725     5
f7() 718.59014 730.99790 768.68290 745.91849 751.9041 896.0038     5
f8()  88.48649  93.96445  99.63467  96.40203 102.7033 116.6171     5

Your new optimized reading file is always much faster. I will use it for my further analysis, thank you.

2
Entering edit mode
Mike Smith ★ 5.5k
@mike-smith
Last seen 16 minutes ago
EMBL Heidelberg / de.NBI

Interesting question, I've never really compared the various HDF5 reading packages against each other.  Since they all rely on the same underlying C library, my assumption was that read speeds would be comparable between them, and factors like compression levels, chunk size, and disk I/O would determine performance - maybe this is not the case.  Given that last update to the hdf5 package was in 2009 and it's no longer on CRAN it definitely wasn't on my radar.

That said, I did recently find some small performance bottlenecks in rhdf5, but they reduced read times by more like 25% rather than many times faster.  You can see if they help your use case by trying the version in the devel branch of Bioconductor.

As far as your commands go, they are fine if you want to read the entire file.  If you only need a subset of the data then it is almost certainly quicker to use something like h5readDataset() to extract only the parts that you need.  Similarly, if you're accessing the same file multiple times, it can be quicker to use the 'low-level' functions and open it once using H5Fopen() followed by H5Dread() on the parts you need.  That really depends what exactly you're doing - it won't be beneficial for a single read of the whole file.

0
Entering edit mode

As said above, I need to read almost everything in these files. That is why I use hdf5load and h5read.

Thanks for the tip, I will try the devel branch version of the rhdf5 package and let you know the potential changes.

EDIT: somehow I could not install the develop version:

> biocLite('rhdf5_2.23.5.tar.gz',type="source")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) ‘rhdf5_2.23.5.tar.gz’
inferring 'repos = NULL' from 'pkgs'
ERROR: dependency ‘Rhdf5lib’ is not available for package ‘rhdf5’
* removing ‘/home/femeunier/R/x86_64-pc-linux-gnu-library/3.4/rhdf5’
Warning message:
In install.packages(pkgs = doing, lib = lib, ...) :
installation of package ‘rhdf5_2.23.5.tar.gz’ had non-zero exit status
1
Entering edit mode

The devel version has changed structure and I've moved the underlying C library into a separate package called Rhdf5lib.  You should be able to install that the usual way from Bioconductor, but you can also install devel versions of both from my Github account with:

BiocInstaller::biocLite('grimbough/Rhdf5lib')
BiocInstaller::biocLite('grimbough/rhdf5')
0
Entering edit mode

Ok, I could install it but it did not change much as compared to the optimized solution you suggested above.