The issue here is at the time of this writing rhdf5 treats data.frames specially (writing as COMPOUND type by default) and COMPOUND data types are opaque to rhdf5 (again, at time of this writing). See compoundAsDataFrame and DataFrameAsCompound in ?h5write. Compound types are all-or-nothing; subsetting not supported. See post by Bernd, maintainer of rhdf5: C: Reading by column . To illustrate, try this in an R session:
library(rhdf5)
(h5fl <- tempfile(fileext=".h5"))
h5createFile(file=h5fl)
matr <- matrix(1:12, nrow=4) ## control: 2D obj
h5write(matr, h5fl, "matr")
df <- data.frame(a=1:4, b=c(1.1, 2.1, 3.1, 4.1), d=42:45)
h5write(df, h5fl, "dfcompound")
h5write(df, h5fl, "dfsep", DataFrameAsCompound=FALSE)
H5close()
h5ls(h5fl) ## matr dim: 4 x 3 (known, transparent); dfcompound dim: 3 (opaque)
## issues warning (but not error), but guesses it's doing the right
## thing because length(index) matches dimensional extent of
## dfcompound reported by h5ls
h5read(h5fl, "dfcompound", index=list(2))
h5read(h5fl, "dfsep", index=list(2:3)) ## also wrong
The solution is straightforward; just create a function that manages the fact the columns are written to separate datasets in vector-like fashion:
## allColNames: character(), names of *all* the data.frame's columns
## (in original order), not just those selected
readCompound <- function(file, name, allColNames, index=NULL, ...) {
rowSubset <- NULL
colnms <- allColNames
if( !is.null(index) ) {
if( !is.null(index[[1]]) )
rowSubset <- index[[1]]
if( !is.null(index[[2]]) )
colnms <- allColNames[index[[2]]]
}
coldatasets <- paste(name, colnms, sep="/")
df <- lapply(coldatasets, function(x) {
h5read(file, x, index=list(rowSubset), ...)
})
names(df) <- colnms
as.data.frame(df)
}
(res <- readCompound(h5fl, "dfsep", names(df)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(NULL, 1)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(2:4, NULL)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(2:4, 2:3)))
Should be easy to extrapolate writing to such a dataset.
As for adding rows to a data.frame, I suggest creating the datasets independently, but in a way that mirrors the effect of DataFrameAsCompound=FALSE. And perhaps take advantage of creating a dataset with room to grow (via maxdims). But hdf5 represents repeated values pretty efficiently, so growing might not be necessary for your use case.
createDFDataset <- function(file, datasetnm, df, dims=dim(df), maxdims=dim(df)) {
coldatasetnms <- paste(datasetnm, names(df), sep="/")
stormodes <- vapply(df, storage.mode, character(1))
h5createGroup(file, datasetnm)
success <- vapply(seq_along(df), function(x) {
h5createDataset(file, coldatasetnms[x], dims, maxdims,
storage.mode=stormodes[x])
}, logical(1))
if(all(success)) TRUE else names(df)[which(!success)]
}
## maxdims with room to grow
(res <- createDFDataset(h5fl, "thedf", df, c(2, 3), c(8, 3)))
H5close()
h5ls(h5fl, all=TRUE)
Here's a quick illustration of growing a dataset and writing to it:
library(rhdf5)
vec <- sample(1:99, 10000, replace=TRUE)
(h5grow <- tempfile(fileext=".h5"))
h5createFile(h5grow)
h5createDataset(h5grow, "vec", 5000, 10000, storage.mode="integer")
h5write(vec[1:5000], h5grow, "vec")
h5set_extent(h5grow, "vec", 10000)
h5write(vec[5001:length(vec)], h5grow, "vec", index=list(5001:length(vec)))
Let me know if you have any questions.
-Nate