Using HDF5Array writeHDF5Array for a large dgCMatrix
1
2
Entering edit mode
rubi ▴ 110
@rubi-6462
Last seen 5.7 years ago

Hi,

I have a large sparse matrix (40,000 by 40,000, called mat) as a dgCMatrix which I would like to save to an HDF5 file.

The conversion of the dgCMatrix to an array (arr <- as(Matrix::as.matrix(mat),"HDF5Array")) is very slow and then writing the array to the HDF5 file, using:  writeHDF5Array(x=arr,file=my_file) is even much slower.

 

Am I missing something or is this the best performance of HDF5Array for saving a dgCMatrix to an HDF5 file? (since in contrast saving the dgCMatrix  to an RDS file is very fast)

 

HDF5Array dgCMatrix • 2.3k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 20 hours ago
The city by the bay

There are several considerations here:

  1. You should be able to use setHDF5DumpFile() to control the file location of the initial conversion in as(). As it is now, you are performing three read/writes from file - one in the initial as(), another to read from arr during writeHDF5Array, and another to write to my_file. If you run setHDF5DumpFile() specifying my_file, you can cut the required time 3-fold.
  2. I would increase the block size with options(DelayedArray.block.size=2e8), which specifies the number of values that are processed every time. The default block size is pretty small, IIRC (maybe 75000?), and increasing it should improve speed at the cost of increasing memory usage.
  3. Saving the dgCMatrix as an RDS file is not comparable to the generation of a HDF5Array. There's no way to access subsets of the RDS file without loading the entire thing into memory; efficient extraction of subsets of the matrix is the HDF5Array's raison d'etre. This is also part of the reason why saving a HDF5Array is slower - it needs to represent both zero and non-zero values, while the RDS only stores the latter.

Indeed, part of the slow-down in point 3 is probably due to the as.matrix() call in the as(), which makes a somewhat large dense matrix. I would have thought that as() would have worked without requiring the formation of an intermediate dense matrix. In any case, I have done something similar in the DropletUtils package (see the read10xMatrix() function), though applying it here would require you to call Matrix::writeMM() first on your dgCMatrix to save it to file in the MatrixMarket format.

I would guess that the HDF5 chunk dimensions are not a consideration here - IIRC, the HDF5Array package tries to use column-based chunks, which should be pretty efficient with any use of as() (i.e., avoiding repeated writes to the same chunk).

ADD COMMENT

Login before adding your answer.

Traffic: 792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6