Zarr Dask Table
See original GitHub issueWe would like a persistence format for Dask.dataframes that can scale to multiple machines. Solutions like HDF5, BColz, Castra are single-file-system only and have their own issues besides. Solutions like CSV can scale out but are inefficient. Solutions like Parquet are currently poorly supported in Python, and are complex on a single-machine.
Zarr is an interesting library that implements a sane subset of the HDF5 model (regularly chunked ndarrays, groups, metadata) on MutableMappings (memory, disk, s3/hdfs). I’m curious what a modern tabular format would look like built on top of Zarr and what the performance would be. This could be a stop-gap until Parquet support, or it could be a long-term competitor that also scaled down nicely, or extended out to nd-array and grouped case.
Some things Zarr does well
- Scales from memory, to single-file-disk, to multi-file-disk, to S3/HDFS, to other
- Tuned performance
- Compression
- Sane and simple design with a published spec, active and funded maintainer, and good test coverage
- Extensible with metadata and groups
Some things we would need to figure out
- How do we efficiently encode text data
- How do we efficiently encode categorical data
- How do we encode partition information
- How do we deal with the fact that partitions aren’t regularly sized or that we don’t even have known sizes ahead of time
Motivation
I think that this could be useful for Dask, fulfilling a need that we have. I also think that it could be a fun experiment for Zarr, to see how it responds to a new use case with different constraints.
I don’t think that this replaces efforts towards Parquet Python support, which remains a dominant storage format with many of the above questions already answered well.
Thoughts on partitions
I see two options here:
- We use one partition per zarr-array, arranging them into a group
- We push on Zarr to support unknown chunk-sized arrays that don’t support slicing, but do support picking out particular chunks
If using Zarr with many single chunked arrays organzied into groups is not particularly slow then I say we stick with that. Otherwise I’d be curious what providing a chunks=None
option would look like for Zarr and if this added complexity is worth it
Thoughts on Text
I’ve been using msgpack
to encode lists of text lately, which seems to do a good job in terms of maturity and performance. I’m curious if there is any appetite with Zarr to expand the spec to include a special text dtype. This is a clear deviation from the “zarr’s model is just numpy’s model” but text is an important case that NumPy doesn’t appear likely to handle well in the moderate future.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:28 (28 by maintainers)
Top GitHub Comments
A Zarr Array or Group can be pickled, as long as the underlying store instance can be pickled. All storage classes provided by Zarr (DictStore, DirectoryStore, ZipStore) can be pickled, as of course can the built-in dict class which can also be used for storage. This behaviour is covered in the Zarr test suite.
On Monday, November 28, 2016, jakirkham notifications@github.com wrote:
– Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
Parquet is now decently well supported. Closing