Question

Querying a HDF5 database from R

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hello, I have a database (22 GB) in SQLite that I query from R for numerical analysis. I'm considering converting the database to HDF5 for faster read times (because reading the population is slow). I have two questions about the rhdf5 package that I haven't been able to figure out from my own experimenting. (i) Suppose that I save an R dataframe to a HDF file. Is it possible to read subsets of the dataframe based on variable names and variable values? Often, I don't won't to read the full dataframe into memory (~ 100 million observations and ~ 30 variables). (ii) I frequently use indexes in my SQLite database to quickly join related tables. Does rhdf5 have a similar feature? If not, will converting to a HDF5 database create substantial bottlenecks if I rely on these joins frequently? Thanks so much for your help. -- output of sessionInfo(): N/A -- Sent via the guest posting facility at bioconductor.org.

rhdf5 rhdf5 • 1.6k views

ADD COMMENT • link updated 10.3 years ago by Bernd Fischer ▴ 550 • written 10.3 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2014-01-17

Dear James! On 08.01.2014, at 19:19, James at embl-heidelberg.de wrote: > I have a database (22 GB) in SQLite that I query from R for numerical analysis. I'm considering converting the database to HDF5 for faster read times (because reading the population is slow). I have two questions about the rhdf5 package that I haven't been able to figure out from my own experimenting. > > (i) Suppose that I save an R dataframe to a HDF file. Is it possible to read subsets of the dataframe based on variable names and variable values? Often, I don't won't to read the full dataframe into memory (~ 100 million observations and ~ 30 variables). HDF5 does not support the R-specific structure 'data.frame'. In the standard settings, data.frame's are stored as an HDF5 compound datatype. Reading compound data is very inefficient. You should consider storing your data as arrays. This will provide you with the most efficient reading and writing methods. And you are able to subset these arrays. Subsetting is currently not supported for data.frames/Compound data types. Subsetting with variable names is not part of rhdf5/HDF5. You can only subset using Integers. You have to organize the mapping from names to indices yourself. > (ii) I frequently use indexes in my SQLite database to quickly join related tables. Does rhdf5 have a similar feature? If not, will converting to a HDF5 database create substantial bottlenecks if I rely on these joins frequently? HDF5 is not a database management system as is SQLite. If you need a 'join', you should consider SQLite or other, more efficient database management systems. > Thanks so much for your help. > > -- output of sessionInfo(): > > N/A > > -- > Sent via the guest posting facility at bioconductor.org.