Hello,
I have a database (22 GB) in SQLite that I query from R for numerical
analysis. I'm considering converting the database to HDF5 for faster
read times (because reading the population is slow). I have two
questions about the rhdf5 package that I haven't been able to figure
out from my own experimenting.
(i) Suppose that I save an R dataframe to a HDF file. Is it possible
to read subsets of the dataframe based on variable names and variable
values? Often, I don't won't to read the full dataframe into memory (~
100 million observations and ~ 30 variables).
(ii) I frequently use indexes in my SQLite database to quickly join
related tables. Does rhdf5 have a similar feature? If not, will
converting to a HDF5 database create substantial bottlenecks if I rely
on these joins frequently?
Thanks so much for your help.
-- output of sessionInfo():
N/A
--
Sent via the guest posting facility at bioconductor.org.
Dear James!
On 08.01.2014, at 19:19, James at embl-heidelberg.de wrote:
> I have a database (22 GB) in SQLite that I query from R for
numerical analysis. I'm considering converting the database to HDF5
for faster read times (because reading the population is slow). I have
two questions about the rhdf5 package that I haven't been able to
figure out from my own experimenting.
>
> (i) Suppose that I save an R dataframe to a HDF file. Is it possible
to read subsets of the dataframe based on variable names and variable
values? Often, I don't won't to read the full dataframe into memory (~
100 million observations and ~ 30 variables).
HDF5 does not support the R-specific structure 'data.frame'. In the
standard settings,
data.frame's are stored as an HDF5 compound datatype. Reading compound
data is
very inefficient. You should consider storing your data as arrays.
This will provide you
with the most efficient reading and writing methods. And you are able
to subset these
arrays. Subsetting is currently not supported for data.frames/Compound
data types.
Subsetting with variable names is not part of rhdf5/HDF5. You can only
subset using
Integers. You have to organize the mapping from names to indices
yourself.
> (ii) I frequently use indexes in my SQLite database to quickly join
related tables. Does rhdf5 have a similar feature? If not, will
converting to a HDF5 database create substantial bottlenecks if I rely
on these joins frequently?
HDF5 is not a database management system as is SQLite. If you need a
'join', you
should consider SQLite or other, more efficient database management
systems.
> Thanks so much for your help.
>
> -- output of sessionInfo():
>
> N/A
>
> --
> Sent via the guest posting facility at bioconductor.org.