Hi merv
Thanks for providing the example file. It's actually a bit difficult to work with HDF5 file that have been compressed like this. This is because by doing this compression, you actually lose some of the advantages of the HDF5 format. Namely the ability to jump into datasets and extract specific subsets. Typically HDF5 datasets are chunked and distributed on disk. The file then has a "map" of where these chunks are located and can then read only the chunks required for the data you need, significantly spreading up I/O for that type of operation. However, if the HDF5 is compressed in an external wrapper like a zip file that "map" contained in the HDF5 is no longer correct. I'm not sure whether it would even be possible to make rhdf5 work with files in this format, but even if it is I don't think it is a desirable way to work with HDF5.
To get round this HDF5 provides compression on the datasets within the file itself. You can use the same DEFLATE algorithm used in ZIP files or many others to compress datasets. In HDF5 parlance these are typically referred to as filters, and lots of programs will apply them by default. I'm surprised that the h5ad file you have here doesn't have this applied, but that looks to be the case.
Here's some code that will create a copy of your h5ad file but will compression enabled for the largest datasets in the file.
library(rhdf5)
library(dplyr)
## paths to our original zip file, and temporary extracted version, and the ouput
compressed_input_file <- "~/Downloads/TS_Mammary.h5ad.zip"
input_file <- utils::unzip(compressed_input_file, exdir = tempdir())
output_file <- "/tmp/TS_Mammary_compressed.h5ad"
## to construct the new file we get the full paths to all groups and datasets in our original file
groups <- h5ls(input_file) |> filter(otype == "H5I_GROUP") |>
mutate(path = paste(group, name, sep = "/")) |> pull(path)
datasets <- h5ls(input_file) |> filter(otype == "H5I_DATASET") |>
mutate(path = paste(group, name, sep = "/"))
## We're only going to compress the large datasets.
## In this case that's anything with more than 1 million elements.
## We get some warnings because the h5ls dimension output is a string with non-numeric characters
large_datasets <- datasets |> filter(as.integer(dim) > 1e6)
#> Warning in mask$eval_all_filter(dots, env_filter): NAs introduced by coercion
small_datasets <- datasets |> filter(as.integer(dim) <= 1e6 | is.na(as.integer(dim)))
#> Warning in mask$eval_all_filter(dots, env_filter): NAs introduced by coercion
## first we create an empty file
h5createFile(file = output_file)
## now populate the group structure
for(i in seq_along(groups)) { h5createGroup(output_file, groups[i]) }
## For the "large" datasets we will create new datasets with compression turned on
## Then write the original data to the new file
for(i in seq_len(nrow(large_datasets))) {
## read the dataset from the original file
ds <- h5read(input_file, large_datasets$path[i])
## the "chunk" and "filter" arguments are what enable to compression
h5createDataset(output_file, large_datasets$path[i],
dims = large_datasets$dim[i],
storage.mode = storage.mode(ds),
chunk = 100000,
filter = "GZIP")
## write the data to our new file and dataset
h5write(ds, output_file, name = large_datasets$path[i])
}
## For the "small" datasets we will copy them directly from on HDF5 file to the other
## This saves having to determine the correct datatype, dimensions, etc for each dataset.
## It's also faster than reading and writing into R
fid1 <- H5Fopen(input_file)
fid2 <- H5Fopen(output_file)
for(i in seq_len(nrow(small_datasets))) {
H5Ocopy(fid1, small_datasets$path[i], fid2, small_datasets$path[i])
}
h5closeAll(fid1, fid2)
We can do a few checks to make sure that the contents of the files is the same:
## test the output of h5ls
identical(
h5ls(input_file),
h5ls(output_file)
)
#> [1] TRUE
## compare the contents of both files
identical(
h5read(input_file, "/"),
h5read(output_file, "/")
)
#> [1] TRUE
As a nice bonus, our new file is actually smaller than the original zip
## our file size is now smaller than the original zip file
file.size(compressed_input_file)
#> [1] 380791465
file.size(output_file)
#> [1] 323117996