modifying compound datasets
5
0
Entering edit mode
@harbydonald-14723
Last seen 4.1 years ago

In my work flow I have large h5 files with many compound datasets.  I normally have to modify a few values in the data.  The structure and names in the h5 can not be changed.  So I just need to modify the data.  I could create new h5 files, modify the data, then copy all the rest of the data but this seems like such a waste.  Another option would be to delete or unlink the datasets but I don't think there is a way in R.  It looks like H5Ldelet or H5Gunlink is not implemented in rhdf5?  Maybe there is some other simple solution that I am overlooking.  Below is some simple code that shows my problem.

library(rhdf5)

(h5fl <- tempfile(fileext=".h5"))
h5createFile(file=h5fl)

df <- data.frame(a=1:4, b=c(1.1, 2.1, 3.1, 4.1), d=42:45)
# create a modified data frame
df_mod <- data.frame(a=1:4, b=c(5.1, 2.1, 3.1, 4.1), d=42:45)

# write the compound data
h5write(df, h5fl, "dfcompound")
# how to modify the compound data?  This gives an error?
h5write(df_mod, h5fl, "dfcompound")
H5close()

h5ls(h5fl)

R • 819 views
2
Entering edit mode
Mike Smith ★ 5.3k
@mike-smith
Last seen 18 hours ago
EMBL Heidelberg / de.NBI

You're correct that H5Ldelete is not available in the current version of rhdf5, and I don't think there's a straight-forward way to do this with the current package version. For data types other than compound you're able to overwrite existing entries, so I guess this hasn't come up before.

Given this I've added an implementation of H5Ldelete() and a more high-level h5delete() to the very developmental version of the package. You can install this from Github using:

BiocInstaller::biocLite('grimbough/Rhdf5lib')
BiocInstaller::biocLite('grimbough/rhdf5', ref = 'H5Ldelete')


Now we can test with your example:

library(rhdf5)
h5fl <- tempfile(fileext=".h5")
h5createFile(file=h5fl)
df <- data.frame(a=1:4, b=c(1.1, 2.1, 3.1, 4.1), d=42:45)
df_mod <- data.frame(a=1:4, b=c(5.1, 2.1, 3.1, 4.1), d=42:45)

h5write(df, h5fl, "dfcompound")

  a   b  d
1 1 1.1 42
2 2 2.1 43
3 3 3.1 44
4 4 4.1 45


Now use h5delete() to remove the "dfcompound" dataset and verify it doesn't exist.

h5delete(file = h5fl, name = "dfcompound")

Error in h5read(h5fl, "dfcompound") :
Object 'dfcompound' does not exist in this HDF5 file.


It's now possible to write a new dataset with the same name.

h5write(df_mod, h5fl, "dfcompound")

  a   b  d
1 1 5.1 42
2 2 2.1 43
3 3 3.1 44
4 4 4.1 45


It would be nice to only overwrite the subset of the dataset that's changed, but I don't know why the original rhdf5 maintainer prevented this - there may be a technical limitation in HDF5 that I'm not aware of for compound datasets. For now hopefully this is sufficient for your needs.

Please let me know if you experience any unexpected issues with it, and if it seems stable I'll incoporated it into the main branch of rhdf5.

On an tangential note, I really don't recommend running H5close() with this version of the package - it's behaviour is likely to break everything. There's now the h5closeAll() function that achieves the same goal, although in an ideal world you wouldn't have to use it at all. If you find you get lots of references to files already being open please let me know, I'm trying to stop that happening.

0
Entering edit mode
@harbydonald-14723
Last seen 4.1 years ago

Thanks for the quick reply and updated code.

Any idea when this will be added to the released version?  I tried to load the developmental version but I am not exactly sure how to use Github.  The developmental version didn't seem to work with my version of R.

0
Entering edit mode

My plan was to let you test the new version of the code, then move it into the release version.  You should be able to install from Github by running these two lines:

BiocInstaller::biocLite('grimbough/Rhdf5lib')
BiocInstaller::biocLite('grimbough/rhdf5', ref = 'H5Ldelete')

If you get an error message report it back here and we'll work through it.

0
Entering edit mode
@harbydonald-14723
Last seen 4.1 years ago

printdatatype.o:printdatatype.c:(.text+0xb65): undefined reference to H5open'
printdatatype.o:printdatatype.c:(.text+0xb6a): undefined reference to H5T_NATIVE_LLONG_g'
printdatatype.o:printdatatype.c:(.text+0xb76): undefined reference to H5Tequal'
printdatatype.o:printdatatype.c:(.text+0xb89): undefined reference to H5open'
printdatatype.o:printdatatype.c:(.text+0xb8e): undefined reference to H5T_NATIVE_ULLONG_g'
printdatatype.o:printdatatype.c:(.text+0xb9a): undefined reference to H5Tequal'
printdatatype.o:printdatatype.c:(.text+0xbb0): undefined reference to H5Tget_size'
printdatatype.o:printdatatype.c:(.text+0xbbd): undefined reference to H5Tget_order'
printdatatype.o:printdatatype.c:(.text+0xbc5): undefined reference to H5Tget_sign'
printdatatype.o:printdatatype.c:(.text+0xc75): undefined reference to H5Tget_class'
collect2.exe: error: ld returned 1 exit status
no DLL was created
ERROR: compilation failed for package 'rhdf5'
* removing 'C:/Program Files/R/R-3.3.3/library/rhdf5'
* restoring previous 'C:/Program Files/R/R-3.3.3/library/rhdf5'
Warning in file.copy(lp, dirname(pkgdir), recursive = TRUE, copy.date = TRUE) :
problem copying <C:\Program Files\R\R-3.3.3\library\00LOCK-grimbough-rhdf5-17605f5\rhdf5\libs\x64\libhdf5ForBioC-7.dll> to <C:\Program Files\R\R-3.3.3\library\rhdf5\libs\x64\libhdf5ForBioC-7.dll>: Permission denied
Warning in file.copy(lp, dirname(pkgdir), recursive = TRUE, copy.date = TRUE) :
problem copying <C:\Program Files\R\R-3.3.3\library\00LOCK-grimbough-rhdf5-17605f5\rhdf5\libs\x64\libsz-2.dll> to <C:\Program Files\R\R-3.3.3\library\rhdf5\libs\x64\libsz-2.dll>: Permission denied
Warning in file.copy(lp, dirname(pkgdir), recursive = TRUE, copy.date = TRUE) :
problem copying <C:\Program Files\R\R-3.3.3\library\00LOCK-grimbough-rhdf5-17605f5\rhdf5\libs\x64\rhdf5.dll> to <C:\Program Files\R\R-3.3.3\library\rhdf5\libs\x64\rhdf5.dll>: Permission denied
Installation failed: Command failed (1)

0
Entering edit mode

I think there's a few separate issue here.  First it looks like it isn't able to find the version of HDF5 distributed with Rhdf5lib (that's the 'undefined reference' errors) so package installation fails, and then it's trying to put the existing version of rhdf5 back and running into permissions problems.

First off, I would download and install R version 3.4.3 (https://cran.r-project.org/bin/windows/base/R-3.4.3-win.exe).  You're using a version of R that is out-of-date, and since Bioconductor versions are generally tied to specified R versions, most of your packages will be out-of-date too.

I would then try to install the packages in this fresh version of R using the following.

source("http://www.bioconductor.org/biocLite.R")
biocLite('devtools')
biocLite('grimbough/Rhdf5lib')
biocLite('grimbough/rhdf5', ref = 'H5Ldelete')

When you do this, make sure you're running R as a regular user, not an Administrator.  If you install packages as an admin they tend to get put somewhere like C:\Program Files\R\R-3.4.3\library which a regular user then can't write to and you get 'Permission denied' errors like above.  If you run R as a regular user it will normally ask if you want to use a 'Personal Library' and put this under Documents\R\R-3.4.3\

Then report back if it's still throwing errors and I'll have to dig more deeply into Windows debugging.

0
Entering edit mode
@harbydonald-14723
Last seen 4.1 years ago

the first line ran fine.

This error was generated from:

BiocInstaller::biocLite('grimbough/rhdf5', ref = 'H5Ldelete')

0
Entering edit mode
@harbydonald-14723
Last seen 4.1 years ago

Got it working thanks!