Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Zarr Dask Table

See original GitHub issue

We would like a persistence format for Dask.dataframes that can scale to multiple machines. Solutions like HDF5, BColz, Castra are single-file-system only and have their own issues besides. Solutions like CSV can scale out but are inefficient. Solutions like Parquet are currently poorly supported in Python, and are complex on a single-machine.

Zarr is an interesting library that implements a sane subset of the HDF5 model (regularly chunked ndarrays, groups, metadata) on MutableMappings (memory, disk, s3/hdfs). I’m curious what a modern tabular format would look like built on top of Zarr and what the performance would be. This could be a stop-gap until Parquet support, or it could be a long-term competitor that also scaled down nicely, or extended out to nd-array and grouped case.

Some things Zarr does well

Scales from memory, to single-file-disk, to multi-file-disk, to S3/HDFS, to other
Tuned performance
Compression
Sane and simple design with a published spec, active and funded maintainer, and good test coverage
Extensible with metadata and groups

Some things we would need to figure out

How do we efficiently encode text data
How do we efficiently encode categorical data
How do we encode partition information
How do we deal with the fact that partitions aren’t regularly sized or that we don’t even have known sizes ahead of time

Motivation

I think that this could be useful for Dask, fulfilling a need that we have. I also think that it could be a fun experiment for Zarr, to see how it responds to a new use case with different constraints.

I don’t think that this replaces efforts towards Parquet Python support, which remains a dominant storage format with many of the above questions already answered well.

Thoughts on partitions

I see two options here:

We use one partition per zarr-array, arranging them into a group
We push on Zarr to support unknown chunk-sized arrays that don’t support slicing, but do support picking out particular chunks

If using Zarr with many single chunked arrays organzied into groups is not particularly slow then I say we stick with that. Otherwise I’d be curious what providing a chunks=None option would look like for Zarr and if this added complexity is worth it

Thoughts on Text

I’ve been using msgpack to encode lists of text lately, which seems to do a good job in terms of maturity and performance. I’m curious if there is any appetite with Zarr to expand the spec to include a special text dtype. This is a clear deviation from the “zarr’s model is just numpy’s model” but text is an important case that NumPy doesn’t appear likely to handle well in the moderate future.

cc @alimanfoo @jcrist @martindurant @hussainsultan @shoyer

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:28 (28 by maintainers)

Top GitHub Comments

1reaction

alimanfoocommented, Nov 29, 2016

A Zarr Array or Group can be pickled, as long as the underlying store instance can be pickled. All storage classes provided by Zarr (DictStore, DirectoryStore, ZipStore) can be pickled, as of course can the built-in dict class which can also be used for storage. This behaviour is covered in the Zarr test suite.

On Monday, November 28, 2016, jakirkham notifications@github.com wrote:

Would Zarr also be supported via custom serialization?

Yes, or could make zarr arrays (or a custom wrapper containing them) pickleable.

Good point. Actually it seems they already pickle ok.

import pickle>>> import zarr>>> a = zarr.open_array(“test.zarr”, mode=“w”, shape=(10000, 10000), chunks=(1000, 1000), dtype=np.float32)>>> pickle.loads(pickle.dumps(a)) Array((10000, 10000), float32, chunks=(1000, 1000), order=C) nbytes: 381.5M; nbytes_stored: 343.3M; ratio: 1.1; initialized: 100/100 compressor: Blosc(cname=‘lz4’, clevel=5, shuffle=1) store: DirectoryStore

As a better test, I tried writing out a pickle file in one process and loading it in another. No problems.

Is this something that is guaranteed/supported by Zarr or does it just happen to work?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/1599#issuecomment-263403021, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qp80FxC-c6VrFBU9UGEYjEuqSzvXks5rC0pggaJpZM4KLI-K .

– Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

0reactions

mrocklincommented, Oct 23, 2018

Parquet is now decently well supported. Closing

Top Results From Across the Web

03 - Using dask and zarr for multithreaded input/output

dask is a Python library that implements lazy data structures (array, dataframe, bag) and a clever thread/process scheduler. It integrates with zarr to...

dask.array.from_zarr

Load array from the zarr storage format. See https://zarr.readthedocs.io for details about the format. Parameters. url: Zarr Array or str or MutableMapping.

Dask array to zarr with unknown shapes - python

I am trying to store a dask array in a zarr file. I have managed to do it when the dask array has...

Using da.delayed for Zarr processing: memory overhead ...

We are working on using dask for image processing of OME-Zarr files. ... Field of view parallelization via OME-NGFF ROI tables.

Synchronizer for Zarr + Dask on Kubernetes - Data

I have successfully set up a GKE kubernetes cluster with an autoscaling node pool; Installed Dask with helm; Successfully port forwarded client and...