Reading HDF5 Files In The Cloud
0
Entering edit mode
@thomas-sandmann-6817
Last seen 1 day ago
USA

I would like to run the example code from the Reading HDF5 Files In The Cloud vignette from the rhdf5 Bioconductor package:

Executing the first example:

public_S3_url <- "https://rhdf5-public.s3.eu-central-1.amazonaws.com/h5ex_t_array.h5"
h5ls(file = public_S3_url, s3 = TRUE)

raises an error, because the Rhdf5lib package hasn't been compiled with support for S3:

Error in H5Pset_fapl_ros3(fapl, s3credentials) : 
  Rhdf5lib was not compiled with support for the S3 VFD

Does anybody have pointers on how to add support for S3 VFD?

This is what I found so far:

  • I found the vignette that documents how the authors of the Rhdf5lib library created their HDF5 distribution. The details are beyond my understanding, unfortunately. The authors also state:

This is for record keeping only, users of the Rhdf5lib package are not expected to follow any of the steps detailed here.

  • The hdf5 group provides information on how to include the S3 VFD into hdf5, e.g. by adding arguments to the configure command. I tried to install the Rhdf5lib library from source and included those arguments via the configure.args argument, but they weren't recognized.
BiocManager::install('Rhdf5lib', type = "source",
                     configure.args = "-DHDF5_ENABLE_ROS3_VFD:BOOL=ON")

Any suggestions would be appreciated - I'd love to understand how to Read HDF5 Files In The Cloud!

Thank you, Thomas

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rhdf5_2.34.0

loaded via a namespace (and not attached):
 [1] compiler_4.0.3     credentials_1.3.0  tools_4.0.3        curl_4.3          
 [5] rhdf5filters_1.2.0 jsonlite_1.7.2     openssl_1.4.3      sys_3.4           
 [9] Rhdf5lib_1.12.1    askpass_1.1
Rhdf5lib rhdf5 • 154 views
ADD COMMENTlink
1
Entering edit mode

I think I figured it out - based on your pointers to openssl!

openssl

I installed openssl through homebrew, which also pulls in curl.

brew install openssl

This by itself didn't fix the problem, e.g. the openssl headers were still not detected.

Symbolic links

I found this issue from an unrelated github project, which recommends to create symbolic links:

sudo ln -s /usr/local/opt/openssl/include/openssl /usr/local/include/openssl
sudo ln -s /usr/local/opt/openssl/lib/* /usr/local/lib/

Rhdf5lib installation

Now the source installation of Rhdf5lib sets the S3_VFD=--enable-ros3-vfd argument automatically, as expected:

[truncated]
checking if the direct I/O virtual file driver (VFD) is enabled... no
checking curl/curl.h usability... yes
checking curl/curl.h presence... yes
checking for curl/curl.h... yes
checking openssl/evp.h usability... yes
checking openssl/evp.h presence... yes
checking for openssl/evp.h... yes
checking openssl/hmac.h usability... yes
checking openssl/hmac.h presence... yes
checking for openssl/hmac.h... yes
checking openssl/sha.h usability... yes
checking openssl/sha.h presence... yes
checking for openssl/sha.h... yes
checking for curl_global_init in -lcurl... yes
checking for EVP_sha256 in -lcrypto... yes
checking if the Read-Only S3 virtual file driver (VFD) is enabled... yes
[truncated]

The installation finishes successfully.

rhdf5 installation

I reinstalled the rhdf5 package as well (because my original installation still didn't pick up the S3 VFD).

install("rhdf5", type = "source", update = FALSE)

And now I can read your example file!

library(rhdf5)
public_S3_url <- "https://rhdf5-public.s3.eu-central-1.amazonaws.com/h5ex_t_array.h5"
h5ls(file = public_S3_url, s3 = TRUE)
group name       otype dclass dim
0     /  DS1 H5I_DATASET  ARRAY   4

Thanks a lot for your help, much appreciated! Thomas

ADD REPLYlink
0
Entering edit mode

Excellent, great that it seems to be working. On my GitHub builder I settle on adding the following two lines to $HOME/.R/Makevars

LDFLAGS="-L/usr/local/opt/openssl@1.1/lib"
CPPFLAGS="-I/usr/local/opt/openssl@1.1/include"

Now it's working I should point out that in my limited testing with larger, real-data, files the h5ls() is surprisingly slow, but h5read() with an index argument seems to work quite well if you know the structure of the file already. This performance is something I'm actively working on at the moment.


Also, do you want to drag you solution above to be an Answer rather than a Comment. Hopefully that'll help someone else looking for the info in the future.

ADD REPLYlink
0
Entering edit mode

Do you install the source package or the built binary? If you're not sure, does Rhdf5lib print hundreds of lines to the screen during installation? If so you're installing from source.

The real answer is that we need to make sure it finds libcurl & libopenssl system libraries during compilation, but you seem to have R packages built around those available, so I assume they're installed on the system.

I've not thought about whether the Mac binaries would ship with support, so it would be good to know how you are installing.

ADD REPLYlink
0
Entering edit mode

Thanks a lot for your super quick reply, Mike!

I have tried both installing the Mac binary

library(BiocManager)
install("Rhdf5lib", update = FALSE)
# retrieves https://bioconductor.org/packages/3.12/bioc/bin/macosx/contrib/4.0/Rhdf5lib_1.12.1.tgz

or the source

install("Rhdf5lib", type = "source", update = FALSE)
# retrieves https://bioconductor.org/packages/3.12/bioc/src/contrib/Rhdf5lib_1.12.1.tar.gz

but got the Rhdf5lib was not compiled with support for the S3 VFD error either way.

I also tried to add configuration arguments to the call, but the configure command did not recognize them:

install("Rhdf5lib", type = "source", update = FALSE, configure.args = c("Rhdf5lib" = "--enable-ros3-vfd"))
configure: WARNING: unrecognized options: --enable-ros3-vfd

or

install("Rhdf5lib", type = "source", update = FALSE, configure.args = c("Rhdf5lib" = "-DHDF5_ENABLE_ROS3_VFD:BOOL=ON"))
error: unrecognized option: `-DHDF5_ENABLE_ROS3_VFD:BOOL=ON'

Is that what you needed to know?

Thanks, Thomas

P.S.: After you pointed out that this might be Mac specific, I tried a BioC docker container running Linux. I installed rhdf5 (from source) and was able to access the remote HDF5 file. So it seems you are right, the Mac version might be missing the S3 support.

ADD REPLYlink
0
Entering edit mode

Thanks for the info. You won't be able to enable this via configure.args. Rhdf5lib has its own configure file that wraps the HDF5 configuration. It's a simplified version to control many of the options. That makes it easier for me to support the majority of users, but also means a lot of the general HDF5 documentation isn't appropriate.

I took at look at the Bioconductor log when the build system creates the Mac binary. You can see fairly close to the top the message S3_VFD=--enable-ros3-vfd=no. That's the argument that will be passed to HDF5 during compilation. Rhdf5lib selects that based on the availability of libcurl and libopenssl, so the only way to make it change that to a yes is to make sure those can be found.

You can see the results of the individual tests on the lines above e.g.

checking curl/curl.h usability... yes
checking curl/curl.h presence... yes
checking for curl/curl.h... yes
checking openssl/evp.h usability... no
checking openssl/evp.h presence... no
checking for openssl/evp.h... no
checking openssl/hmac.h usability... no
checking openssl/hmac.h presence... no
checking for openssl/hmac.h... no
checking openssl/sha.h usability... no
checking openssl/sha.h presence... no
checking for openssl/sha.h... no

This suggests that openssl isn't installed, or at least can't be found. I don't know enough about the Bioconductor build machines to say which is the case, but it explains why the binary version doesn't have S3 support. I'm also not sure how portable it would be if the build system did have it, but a user did not.

I see the same on my GitHub Actions Mac builder. On there I explicitly install openssl with brew install openssl, so that clearly isn't sufficient to get this working. I'm not really a Mac user, but maybe this gives you some tips on what might be needed to get it working on your system. I'll try to get the version on Github working, and will report back if I find the right instructions.

ADD REPLYlink
0
Entering edit mode

Many thanks for your detailed explanation - and for pointing me in the right direction. I will report back if I can figure out how to make it work on my system. (And if you'd like a tester - beyond your github setup - let me know!)

Thanks, Thomas

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Traffic: 226 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.4