Hi merv
Thanks for providing the example file. It's actually a bit difficult to work with HDF5 file that have been compressed like this. This is because by doing this compression, you actually lose some of the advantages of the HDF5 format. Namely the ability to jump into datasets and extract specific subsets. Typically HDF5 datasets are chunked and distributed on disk. The file then has a "map" of where these chunks are located and can then read only the chunks required for the data you need, significantly spreading up I/O for that type of operation. However, if the HDF5 is compressed in an external wrapper like a zip file that "map" contained in the HDF5 is no longer correct. I'm not sure whether it would even be possible to make rhdf5 work with files in this format, but even if it is I don't think it is a desirable way to work with HDF5.
To get round this HDF5 provides compression on the datasets within the file itself. You can use the same DEFLATE algorithm used in ZIP files or many others to compress datasets. In HDF5 parlance these are typically referred to as filters, and lots of programs will apply them by default. I'm surprised that the h5ad file you have here doesn't have this applied, but that looks to be the case.
Here's some code that will create a copy of your h5ad file but will compression enabled for the largest datasets in the file.
library(rhdf5)
library(dplyr)
compressed_input_file <- "~/Downloads/TS_Mammary.h5ad.zip"
input_file <- utils::unzip(compressed_input_file, exdir = tempdir())
output_file <- "/tmp/TS_Mammary_compressed.h5ad"
groups <- h5ls(input_file) |> filter(otype == "H5I_GROUP") |>
mutate(path = paste(group, name, sep = "/")) |> pull(path)
datasets <- h5ls(input_file) |> filter(otype == "H5I_DATASET") |>
mutate(path = paste(group, name, sep = "/"))
large_datasets <- datasets |> filter(as.integer(dim) > 1e6)
small_datasets <- datasets |> filter(as.integer(dim) <= 1e6 | is.na(as.integer(dim)))
h5createFile(file = output_file)
for(i in seq_along(groups)) { h5createGroup(output_file, groups[i]) }
for(i in seq_len(nrow(large_datasets))) {
ds <- h5read(input_file, large_datasets$path[i])
h5createDataset(output_file, large_datasets$path[i],
dims = large_datasets$dim[i],
storage.mode = storage.mode(ds),
chunk = 100000,
filter = "GZIP")
h5write(ds, output_file, name = large_datasets$path[i])
}
fid1 <- H5Fopen(input_file)
fid2 <- H5Fopen(output_file)
for(i in seq_len(nrow(small_datasets))) {
H5Ocopy(fid1, small_datasets$path[i], fid2, small_datasets$path[i])
}
h5closeAll(fid1, fid2)
We can do a few checks to make sure that the contents of the files is the same:
identical(
h5ls(input_file),
h5ls(output_file)
)
identical(
h5read(input_file, "/"),
h5read(output_file, "/")
)
As a nice bonus, our new file is actually smaller than the original zip
file.size(compressed_input_file)
file.size(output_file)