Question

Support of parquet files for on-disk Bioconductor objects

0

Entering edit mode

sandmann.t ▴ 70

@sandmannt-11014

Last seen 7 months ago

United States

I am curious about the use of parquet files as an on-disk storage back-end for Bioconductor objects. So far, I have found some references to parquet files in Bioconductor packages, e.g.

but there doesn't seem to be broader support e.g. via the (awesome!) DelayedArray package, yet.

I am aware of the support of true matrix-storage formats, e.g.

HDF5 files in HDF5Array (thank you, Herve!), but am looking for even better support of cloud storage systems, or

tiledb, supported via TileDBArray (thank you, Aaron!), but - unlike parquet files - tiledb has not been adopted in my work environment, yet.

Before I continue experimenting with marrying parquet and Bioconductor further, I was wondering if "parquet-backed Bioconductor objects" were a bad idea to begin with (and if so - why!). Or if there are ongoing efforts already that I might benefit from (or contribute to).

Many thanks for any thoughts and pointers,

Thomas

DelayedArray TileDBArray • 1.0k views

ADD COMMENT • link 7 months ago sandmann.t ▴ 70

0

Entering edit mode

Am I correct in thinking that parquet files are very column orientated? So they're a great analogy to a data.frame, but not so good a match to matrices/arrays, where you might want to extract features along any dimension? I guess I'm worried that things might appear array-like, but performance will be very different in different dimensions.

ADD REPLY • link 7 months ago Mike Smith ★ 6.5k

0

Entering edit mode

In principle, a Parquet file would be no different from 10X's HDF5 format for sparse matrices. Each matrix column would constitute a Parquet row group, containing the usual i/j/x sparse triplet (maybe the j column can be omitted as it is redundant with the row group ID for the matrix column). If i is used as the sort column within the row group, then you've got a CSC layout inside the Parquet file. At that point, the performance can be expected to be similar to the HDF5 format, i.e., great for column access, pretty bad for row access. Given that we already have a TENxMatrix, I don't see why we couldn't have a ParquetMatrix in the same manner.

ADD REPLY • link 7 months ago Aaron Lun ★ 28k

0

Entering edit mode

Aaron Lun Just to make sure I understand: you mentioned: Each matrix column would constitute a Parquet row group.

For a typical RNA-seq experiment, the number of rows (= genes) is in the tens of thousands. Isn't that a little small for a parquet row group for efficient access? I think arrow:: write_parquet() defaults to the total number of rows if the data has fewer than 250 million cells (rows x cols). (The _total number of rows_ refers to the number of i/j/x sparse triplets, I think.)

For the tenx_pbmc4k dataset with 19773 detected genes in 4340 cells I end up with 6 row groups in the parquet file (see example below).

In the i/j/x representation, there will also be different numbers of rows (i) for each column (j), e.g. in a single cell experiment different numbers of genes will be detectable in each cell. I not sure how to choose a single row group size in that case.

Perhaps you can help me understand what you meant, and whether I should try to optimize this choice?

library(arrow)
library(Matrix)
library(TENxPBMCData)

tenx_pbmc4k <- suppressMessages(TENxPBMCData(dataset = "pbmc4k"))

df <- as.data.frame(
  Matrix::summary(
    as(counts(tenx_pbmc4k), "dgCMatrix")
  )
)

df <- (
  data.frame(
    i = factor(row.names(tenx_pbmc4k)[df$i], levels = row.names(tenx_pbmc4k)),
    j = factor(tenx_pbmc4k$Barcode[df$j], levels = tenx_pbmc4k$Barcode),
    x = df$x
  )
)
# range of the number of detected genes per cell
range(table(df$j)) #  498 5251

parquet_file <- tempfile(fileext = ".parquet")

# chunk_size argument: scalar integer, how many rows will be in each row group
arrow::write_parquet(x = df, sink = parquet_file, use_dictionary = TRUE,
                     chunk_size = NULL, version = "2.6")
pq <- arrow::ParquetFileReader$create(parquet_file)
pq$num_row_groups  # 6

ADD REPLY • link 7 months ago sandmann.t ▴ 70

0

Entering edit mode

Oops. I was thinking that the row group sizes were variable and we could have fine-grained control over their sizes. Apparently not.

Well, no matter; it can still be made to work. Just store the usual triplets as you did in df, sorted by j and then i. Then you can easily strip out a column's worth of matrix data by querying the file on j.

Row access will probably suck, though no more than HDF5, given that it would be a full scan of the dataset in both cases.

Total number of rows will be the number of non-zero elements, which should be moderately sized (>200 million) for a medium-sized single-cell dataset, e.g., ~100k cells.

ADD REPLY • link 7 months ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks a lot for pointing that out! I agree and am curious about performance as well. There are a few advantages of parquet format that might make it attractive even if access speed cannot match that of true array-based back-ends, e.g. the ability to add new data to an existing dataset simply by saving parquet files with the same schema in the same directory.

ADD REPLY • link 7 months ago sandmann.t ▴ 70

0

Entering edit mode

Great point, I haven't dived deeply enough into row groups to understand how to use them most effectively. (Added to my to-do list now.) Thanks a lot for sharing your thoughts, I will explore further and share my progress. Any and all feedback will be much appreciated as I learn more.

ADD REPLY • link 7 months ago sandmann.t ▴ 70

score 0 · Answer 1 · 2023-09-13

While it doesn't really solve your matrix problem, I was intrigued enough by the premise to start work on https://github.com/LTLA/ParquetDataFrame.

# Mocking up a file:
tf <- tempfile()
on.exit(unlink(tf))
arrow::write_parquet(mtcars, tf)

# Creating a vector on-disk:
ParquetColumnVector(tf, column="gear")
## <32> DelayedArray object of type "double":
##  [1]  [2]  [3]    . [31] [32] 
##    4    4    4    .    5    4 

# This happily lives inside DataFrames:
collected <- list()
for (x in colnames(mtcars)) {
    collected[[x]] <- ParquetColumnVector(tf, column=x)
}
DataFrame(collected)

So we can now construct DataFrames with a mix of normal, Parquet-derived and other columns. The show method for the DataFrame could possibly be more efficient if I could extract data from multiple columns at once; I'm not sure whether that would be worth creating a separate ParquetDataFrame class.

Anyway, contributions welcome.