Question

pytables to R

0

Entering edit mode

sarfraz • 0

@sarfraz-7298

Last seen 10.8 years ago

United Kingdom

Hello,

First of all apologies if this is not the right place for this type of question. But I am stuck and will really appreciate any help on pointers on this.

I am trying to use rhdf5 to read a subset of a large dataset that was originally created using pytables and stored in a hdf5 file. As the dataset is quite large, I only want to read in few rows (say three in the following example) from the table at a time but running into issues.

library('rhdf5')
library(bit64)

h5ls('LargeDataset.h5', recursive=2)
     group             name       otype   dclass     dim
0        /          ESLarge   H5I_GROUP                 
1 /ESLarge _i_features   H5I_GROUP                 
2 /ESLarge    features H5I_DATASET COMPOUND 4327078

data <- h5read('LargeDataset.h5', 'ESLarge/features', index=list(1:3), bit64conversion='bit64')
Warning message:
In `[.data.frame`(list(a_inferred = c("unknown", "unknown",  :
  'drop' argument will be ignored

I know that the table consists of 4327078 rows and 11 columns. So for above, data variable should contain 3 rows and 11 columns but when I look at data, I can only see 3 rows and 3 columns.

data
           a_inferred     a_label     h_mean
1           unknown        unknown 0.14034226
2           unknown        unknown 0.05577267
3           unknown        unknown 0.03498855

Can someone suggest how can I read few rows with all the columns please? Changing the argument of list gives me different sized square matrix e.g. list(1:6) gives a 6x6 data variable. Doing following also gives an error like,

data <- h5read('LargeDataset.h5', 'ESLarge/features', index=list(1:3, NULL), bit64conversion='bit64')
Error in h5read("LargeDataset.h5", "ESLarge/features",  :
  length of index has to be equal to dimensional extension of HDF5 dataset.

Any ideas please?

rhdf5 • 3.3k views

ADD COMMENT • link 10.8 years ago sarfraz • 0

score 0 · Answer 1 · 2015-01-28

Have you thought about using rpy2 module? Have a look at http://rpy.sourceforge.net/

Inside the python environment you can read the desirable chunk of your hdf5 file and flush it as R object to a file with rpy2. The Pandas python modules has some methods to deal with hdf5 files. I've never done this but perhaps the experts can suggest a better overcome.

Best.

score 0 · Answer 2 · 2015-01-28

If I've diagnosed correctly, the problem is reading subsets of the COMPOUND HDF5 type is not currently supported by rhdf5. See C: Reading by column, a response from the rhdf5 maintainer to a similar question. I don't know pytables, but if possible, try re-encoding the data as one or several matrices. I would also contact the rhdf5 maintainer to inquire about support for subsetting the COMPOUND type.

You can see in the output of h5ls that rhdf5 doesn't know the original dimensions of the data, it just opaquely sees the number of rows. Consider:

> library(rhdf5)
> fl = tempfile()
> h5createFile(fl)
[1] TRUE
> m = matrix(1:12, 6)
> df = data.frame(a=1:6, b=letters[1:6]) ## data.frames encoded as COMPOUND by default
> h5write(m, fl, "m")
> h5write(df, fl, "df")
> h5ls(fl)
  group name       otype   dclass   dim
0     /   df H5I_DATASET COMPOUND     6
1     /    m H5I_DATASET  INTEGER 6 x 2
> h5read(fl, "m", index=list(2:4, NULL)) ## appropriate for atomic type with dimensions
     [,1] [,2]
[1,]    2    8
[2,]    3    9
[3,]    4   10
> h5read(fl, "df", index=list(2:4, NULL)) ## mismatch
Error in h5read(fl, "df", index = list(2:4, NULL)) :
  length of index has to be equal to dimensional extension of HDF5 dataset.
> H5close()
> h5read(fl, "df", index=list(2:4)) ## simply breaks
Error in `[.data.frame`(list(a = 2:4, b = 2:4), 1:3, drop = FALSE) : 
  undefined columns selected
In addition: Warning message:
In `[.data.frame`(list(a = 2:4, b = 2:4), 1:3, drop = FALSE) :
  'drop' argument will be ignored

score 0 · Answer 3 · 2015-01-29

Thank you for the answers and the pointers.

Yes it appears that HDF5 Compound type is causing issues. Unfortunately it appears that even if I create a table of all homogeneous types (say floats) in pytables, it still stores the data in a compound type. This following link on pytables file format also suggests the same,

https://pytables.github.io/usersguide/file_format.html#table-format

My aim was to read this dataset row by row, do some number crunching in R and update a couple of columns of each row based on number crunching. Then I could run some select queries using pytables for exploring the results.