Entering edit mode
Dear Jason!
I added this question to the bioconductor mailing list that other
people can join the discusion and can benefit
from the answers. Please always send these inquiries to
bioconductor@r-project.org.
Reading and writing chunk datasets works fastest, if the left most
dimensions have the same
extension as the dataset itself. E.g. in the example below, the
dataset has extensions
20000 x 10000 and the chunk size is 20000 x 10.
library(rhdf5)
h5createFile("test.h5")
h5createDataset(file="test.h5", dataset="A", dims=c(20000,10000),
chunk = c(20000,10), level=3)
for (i in 1:1000) {
print(i)
S = matrix(rnorm(200000), nrow=20000, ncol=10)
h5write(obj=S, file="test.h5", name="A",
index=list(NULL,1:10+(i-1)*10))
}
On my computer it takes about half a minute to fill the dataset with
random numbers. You can now
even use compression, e.g. by setting level=3. This increases the
runtime to fill the matrix to about
2 minutes, but can reduce the file size a lot.
Best,
Bernd
> Hi Bernd,
> I am now using rhdf5 package to store a large matrix as hdf5 format.
The matrix is about 10000*10000 big and contained float type data. We
want to maximize the speed of reading in data, but we do not care
about the speed of writing the datasets. Though the dataset is stored
in matrix, each time only one or several complete rows will be read,
ie. all columns will be read for specific rows for each time.
According to the manual on bioconductor, the optimal chunk size would
be 100*(number of columns). However, when we increased the chunk size
from 100*100 to 100*1000, the reading in speed significantly
decreases. We did not try 100*10000 chunk size yet because rhdf5
cannot finish writing the dataset for more than several hours. All
testings are done with no compression (level=0)
>
> According to our situation, would you please suggest an optimal
chunk size so that the reading in speed reaches its maximum? Or is
there any other methods to increase the performance? Thanks!
>
>
>
> Best
> Jason
[[alternative HTML version deleted]]