Dataset Column Names in rH5df
1
1
Entering edit mode
kpalmer ▴ 10
@57aa324d
Last seen 14 months ago
Canada

Hey all, Simple question that I've been stuck on. I'm preallocating my dataset such that I can load them up iteratively as I process new data. What I've not been able to sort is how to define coulmn names after the dataset is created. It appears possible when you feed in a dataframe, but not a matrix.

# create a dataset using h5df
h5createDataset(file = 'proj.h5', 
                  dataset ='instrumen1/Metrics2', 
                  dims = c(500, 2000), 
                  chunk = c(50, 2000), 
                  storage.mode = "double",
                  fillValue= NaN)

# Once next batch of data are processed (newData) write it to the block
newData = matrix(0, 3,2000)
dataStart = 1 
columnNames = as.character(10:2010)

# Write new data to existing database
h5write(
    newData,
    file = 'proj.h5',
    dataset ='instrumen1/Metrics2', 
    start = c(dataStart, 1),
    count = c(nrow(newData), ncol(newData)),
    #write.options = list(colnames = TRUE) - this doesn't work 
  )

sessionInfo( )

I've been able to get it to work using the h5dfr package as an attribute so perhaps column names are using the h5writeAttribute? I haven't been able to sort the syntax.. Thanks again!

h5createDataset rhdf5 • 688 views
ADD COMMENT
2
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 4 hours ago
EMBL Heidelberg

Currently rhdf5 doesn't have a mechanism for storing dim names as attributes of an HDF5 dataset. That's because the row/col/slice etc names are stored as a list in R called dimnames, which has the same length as the number of dimensions, but no names. rhdf5 isn't able to write such a list as an HDF5 attribute, so this won't work transparently at the moment.

Since you're creating this custom data structure inside the HDF5 file for your specific data, one approach is to also create two additional datasets called rownames and colnames. When you want to extract data, you would then read the appropriate entries both from your actual data and the dimension name datasets. Then set the names in R using colnames() etc.

You might also find some mileage in the HDF5Array package, which IIRC has some dim name functionality available. It might even make sense to use that interface directly, rather than rhdf5, since it seems you're primarily working with large matrices.

ADD COMMENT
0
Entering edit mode

Thanks Mike!

ADD REPLY

Login before adding your answer.

Traffic: 897 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6