create COMPOUND dataset (2 matrices) with rhdf5?
2
1
Entering edit mode
baptiste ▴ 20
@7ecb7e37
Last seen 7 weeks ago
New Zealand

I'm trying to reproduce a specific file structure to store a complex dataset in HDF5. The example I was given reads as follows:

a <- rhdf5::H5Fopen(f) 
a
HDF5 FILE 
        name /
    filename 

               name       otype   dclass     dim
0 additional_keys   H5I_GROUP                 
2 geometry          H5I_GROUP                   
3 materials         H5I_GROUP                    
5 tmatrix           H5I_DATASET COMPOUND 30 x 30
6 uuid              H5I_DATASET OPAQUE   ( 0 )  
7 vacuum_wavelength H5I_DATASET FLOAT    ( 0 )

where the "tmatrix" dataset is the most important part. It consists of two 30x30 matrices (r and i, since HDF5 does not support complex values):

tmat <- rhdf5::h5read(f, 'tmatrix', compoundAsDataFrame = FALSE)
str(tmat)
List of 2
 $ r: num [1:30, 1:30] -6.01e-05 9.17e-08 -7.26e-09 1.74e-05 1.27e-07 ...
 $ i: num [1:30, 1:30] -4.27e-04 -1.15e-08 -2.75e-09 1.45e-05 7.81e-08 ...

I am unable to re-create this kind of dataset in a new file. I tried two strategies:

Creating sub-groups (fails -- I clearly don't understand what "COMPOUND" means for DATASET):

h5File <- 'test.h5'
rhdf5::h5createFile(h5File)
m <- matrix(1:9,3,3)
rhdf5::h5createDataset(h5File, 'tmatrix', dims = dim(m))
h5createGroup(h5File, 'tmatrix/r') # fails
h5write(Re(m), file=h5File, name="tmatrix")

or wrapping both matrices in a data.frame and using that method (which seems to produce COMPOUND datasets):

h5File <- 'test.h5'
rhdf5::h5createFile(h5File)
m <- matrix(1:9,3,3)
h5write(data.frame(r=I(Re(m)),i=I(Im(m))), file=h5File, name="tmatrix")

but then the matrices have been flattened to dimension 9:

HDF5 FILE 
        name /
    filename 

     name       otype   dclass dim
0 tmatrix H5I_DATASET COMPOUND 9

Many thanks for any advice!

sessionInfo( )

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Pacific/Auckland
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rhdf5_2.46.1

loaded via a namespace (and not attached):
[1] compiler_4.3.1      tools_4.3.1         fs_1.6.3            rstudioapi_0.15.0   rhdf5filters_1.15.1
[6] Rhdf5lib_1.24.0
rhdf5 • 439 views
ADD COMMENT
0
Entering edit mode

PS: first post on this forum, and I had no idea why my message was rejected (all red). Posting a smaller version and then editing it revealed that a en-dash was the culprit! Why aren't UTF8 characters allowed?

ADD REPLY
2
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 9 hours ago
EMBL Heidelberg

I did some work on this and you should now be able to read and write R's complex datatype following the conventions used in h5py and the other languages. This is only available in the devel version of rhdf5 (2.47.3).

library(rhdf5)
packageVersion('rhdf5')
#> [1] '2.47.3'

## create a complex matrix
mat <- matrix(complex(length.out = 30, real = 1:30, imaginary = 30:1), ncol = 10)
mat
#>       [,1]  [,2]  [,3]   [,4]   [,5]   [,6]   [,7]  [,8]  [,9] [,10]
#> [1,] 1+30i 4+27i 7+24i 10+21i 13+18i 16+15i 19+12i 22+9i 25+6i 28+3i
#> [2,] 2+29i 5+26i 8+23i 11+20i 14+17i 17+14i 20+11i 23+8i 26+5i 29+2i
#> [3,] 3+28i 6+25i 9+22i 12+19i 15+16i 18+13i 21+10i 24+7i 27+4i 30+1i

## create a file and write the complex matrix
h5file <- tempfile(fileext = '.h5')
h5createFile(h5file)
h5write(mat, h5file, '/test')

## check we have a compound dataset
h5ls(h5file)
#>   group name       otype   dclass    dim
#> 0     / test H5I_DATASET COMPOUND 3 x 10

## read it and we get a complex matrix back
res <- h5read(h5file, '/test')
res
#>       [,1]  [,2]  [,3]   [,4]   [,5]   [,6]   [,7]  [,8]  [,9] [,10]
#> [1,] 1+30i 4+27i 7+24i 10+21i 13+18i 16+15i 19+12i 22+9i 25+6i 28+3i
#> [2,] 2+29i 5+26i 8+23i 11+20i 14+17i 17+14i 20+11i 23+8i 26+5i 29+2i
#> [3,] 3+28i 6+25i 9+22i 12+19i 15+16i 18+13i 21+10i 24+7i 27+4i 30+1i
identical(res, mat)
#> [1] TRUE

If you're switching between languages, you might want to explore the native argument. Because HDF5 stores data in row-major order, rhdf5 transposes matrices during I/O operations. You can stop this behaviour by using native = TRUE, which will retain whatever orientation the data are in the file. If you have something generated in python, or you want to feed it back to a python program then that might be useful.

## write the transpose matrix for interoperability with python
h5write(mat, h5file, '/test-native', native = TRUE)

## the new dataset is transposed
h5ls(h5file)
#>   group        name       otype   dclass    dim
#> 0     /        test H5I_DATASET COMPOUND 3 x 10
#> 1     / test-native H5I_DATASET COMPOUND 10 x 3

## now native = TRUE is required in R to get the original dimensions
h5read(h5file, '/test-native',  native = TRUE)
#>       [,1]  [,2]  [,3]   [,4]   [,5]   [,6]   [,7]  [,8]  [,9] [,10]
#> [1,] 1+30i 4+27i 7+24i 10+21i 13+18i 16+15i 19+12i 22+9i 25+6i 28+3i
#> [2,] 2+29i 5+26i 8+23i 11+20i 14+17i 17+14i 20+11i 23+8i 26+5i 29+2i
#> [3,] 3+28i 6+25i 9+22i 12+19i 15+16i 18+13i 21+10i 24+7i 27+4i 30+1i

If you run into more requirements feel free to ask here or open an issue at https://github.com/grimbough/rhdf5/issues

ADD COMMENT
1
Entering edit mode

Oh, very nice! I did a quick try and it worked as advertised :) I'll report back when I have done more testing on the actual data. Many thanks

ADD REPLY
1
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 9 hours ago
EMBL Heidelberg

As things stand, I don't think you can do this with rhdf5. To create a compound dataset you have to create your own datatype, but rhdf5 doesn't expose H5Tcreate() or H5Tinsert() to the end user. Both of those are needed to create a compound datatype. It uses them internally when writing data.frames but it makes some assumptions, as you've seen, about flattening those to 2-dimensional tables.

I think the neatest solution might be for rhdf5 to handle R's complex type and write that directly into the format you've seen. This seems to be the approach taken by PyTables. Do you know what software or tool was used to create the orginal file?

I'll take a look at this over the weekend and try to report back here once I've implemented something.

ADD COMMENT
0
Entering edit mode

Thanks, it's nice to hear confirmation. I'm pretty sure the file was created in Python with h5py, which takes the convention of storing complex arrays with r and i fields. I've switched to Matlab in the meantime, and the "EasyH5" toolbox also has a similar convention to store complex arrays. I also looked at Julia's HDF5.jl and it has taken the same convention as well. Seems like it would be advantageous for R to follow that trend!

Many thanks!

ADD REPLY

Login before adding your answer.

Traffic: 528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6