Querying a HDF5 database from R
1
0
Entering edit mode
Guest User ★ 12k
@guest-user-4897
Last seen 7.2 years ago
Hello, I have a database (22 GB) in SQLite that I query from R for numerical analysis. I'm considering converting the database to HDF5 for faster read times (because reading the population is slow). I have two questions about the rhdf5 package that I haven't been able to figure out from my own experimenting. (i) Suppose that I save an R dataframe to a HDF file. Is it possible to read subsets of the dataframe based on variable names and variable values? Often, I don't won't to read the full dataframe into memory (~ 100 million observations and ~ 30 variables). (ii) I frequently use indexes in my SQLite database to quickly join related tables. Does rhdf5 have a similar feature? If not, will converting to a HDF5 database create substantial bottlenecks if I rely on these joins frequently? Thanks so much for your help. -- output of sessionInfo(): N/A -- Sent via the guest posting facility at bioconductor.org.
rhdf5 rhdf5 • 1.1k views
ADD COMMENT
0
Entering edit mode
Bernd Fischer ▴ 540
@bernd-fischer-5348
Last seen 4.9 years ago
Germany / Heidelberg / DKFZ
Dear James! On 08.01.2014, at 19:19, James at embl-heidelberg.de wrote: > I have a database (22 GB) in SQLite that I query from R for numerical analysis. I'm considering converting the database to HDF5 for faster read times (because reading the population is slow). I have two questions about the rhdf5 package that I haven't been able to figure out from my own experimenting. > > (i) Suppose that I save an R dataframe to a HDF file. Is it possible to read subsets of the dataframe based on variable names and variable values? Often, I don't won't to read the full dataframe into memory (~ 100 million observations and ~ 30 variables). HDF5 does not support the R-specific structure 'data.frame'. In the standard settings, data.frame's are stored as an HDF5 compound datatype. Reading compound data is very inefficient. You should consider storing your data as arrays. This will provide you with the most efficient reading and writing methods. And you are able to subset these arrays. Subsetting is currently not supported for data.frames/Compound data types. Subsetting with variable names is not part of rhdf5/HDF5. You can only subset using Integers. You have to organize the mapping from names to indices yourself. > (ii) I frequently use indexes in my SQLite database to quickly join related tables. Does rhdf5 have a similar feature? If not, will converting to a HDF5 database create substantial bottlenecks if I rely on these joins frequently? HDF5 is not a database management system as is SQLite. If you need a 'join', you should consider SQLite or other, more efficient database management systems. > Thanks so much for your help. > > -- output of sessionInfo(): > > N/A > > -- > Sent via the guest posting facility at bioconductor.org.
ADD COMMENT

Login before adding your answer.

Traffic: 309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6