question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

We would like a persistence format for Dask.dataframes that can scale to multiple machines. Solutions like HDF5, BColz, Castra are single-file-system only and have their own issues besides. Solutions like CSV can scale out but are inefficient. Solutions like Parquet are currently poorly supported in Python, and are complex on a single-machine.

Zarr is an interesting library that implements a sane subset of the HDF5 model (regularly chunked ndarrays, groups, metadata) on MutableMappings (memory, disk, s3/hdfs). I’m curious what a modern tabular format would look like built on top of Zarr and what the performance would be. This could be a stop-gap until Parquet support, or it could be a long-term competitor that also scaled down nicely, or extended out to nd-array and grouped case.

Some things Zarr does well

  • Scales from memory, to single-file-disk, to multi-file-disk, to S3/HDFS, to other
  • Tuned performance
  • Compression
  • Sane and simple design with a published spec, active and funded maintainer, and good test coverage
  • Extensible with metadata and groups

Some things we would need to figure out

  • How do we efficiently encode text data
  • How do we efficiently encode categorical data
  • How do we encode partition information
  • How do we deal with the fact that partitions aren’t regularly sized or that we don’t even have known sizes ahead of time

Motivation

I think that this could be useful for Dask, fulfilling a need that we have. I also think that it could be a fun experiment for Zarr, to see how it responds to a new use case with different constraints.

I don’t think that this replaces efforts towards Parquet Python support, which remains a dominant storage format with many of the above questions already answered well.

Thoughts on partitions

I see two options here:

  1. We use one partition per zarr-array, arranging them into a group
  2. We push on Zarr to support unknown chunk-sized arrays that don’t support slicing, but do support picking out particular chunks

If using Zarr with many single chunked arrays organzied into groups is not particularly slow then I say we stick with that. Otherwise I’d be curious what providing a chunks=None option would look like for Zarr and if this added complexity is worth it

Thoughts on Text

I’ve been using msgpack to encode lists of text lately, which seems to do a good job in terms of maturity and performance. I’m curious if there is any appetite with Zarr to expand the spec to include a special text dtype. This is a clear deviation from the “zarr’s model is just numpy’s model” but text is an important case that NumPy doesn’t appear likely to handle well in the moderate future.

cc @alimanfoo @jcrist @martindurant @hussainsultan @shoyer

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:28 (28 by maintainers)

github_iconTop GitHub Comments

1reaction
alimanfoocommented, Nov 29, 2016

A Zarr Array or Group can be pickled, as long as the underlying store instance can be pickled. All storage classes provided by Zarr (DictStore, DirectoryStore, ZipStore) can be pickled, as of course can the built-in dict class which can also be used for storage. This behaviour is covered in the Zarr test suite.

On Monday, November 28, 2016, jakirkham notifications@github.com wrote:

Would Zarr also be supported via custom serialization?

Yes, or could make zarr arrays (or a custom wrapper containing them) pickleable.

Good point. Actually it seems they already pickle ok.

import pickle>>> import zarr>>> a = zarr.open_array(“test.zarr”, mode=“w”, shape=(10000, 10000), chunks=(1000, 1000), dtype=np.float32)>>> pickle.loads(pickle.dumps(a)) Array((10000, 10000), float32, chunks=(1000, 1000), order=C) nbytes: 381.5M; nbytes_stored: 343.3M; ratio: 1.1; initialized: 100/100 compressor: Blosc(cname=‘lz4’, clevel=5, shuffle=1) store: DirectoryStore

As a better test, I tried writing out a pickle file in one process and loading it in another. No problems.

Is this something that is guaranteed/supported by Zarr or does it just happen to work?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/1599#issuecomment-263403021, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qp80FxC-c6VrFBU9UGEYjEuqSzvXks5rC0pggaJpZM4KLI-K .

– Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

0reactions
mrocklincommented, Oct 23, 2018

Parquet is now decently well supported. Closing

Read more comments on GitHub >

github_iconTop Results From Across the Web

03 - Using dask and zarr for multithreaded input/output
dask is a Python library that implements lazy data structures (array, dataframe, bag) and a clever thread/process scheduler. It integrates with zarr to...
Read more >
dask.array.from_zarr
Load array from the zarr storage format. See https://zarr.readthedocs.io for details about the format. Parameters. url: Zarr Array or str or MutableMapping.
Read more >
Dask array to zarr with unknown shapes - python
I am trying to store a dask array in a zarr file. I have managed to do it when the dask array has...
Read more >
Using da.delayed for Zarr processing: memory overhead ...
We are working on using dask for image processing of OME-Zarr files. ... Field of view parallelization via OME-NGFF ROI tables.
Read more >
Synchronizer for Zarr + Dask on Kubernetes - Data
I have successfully set up a GKE kubernetes cluster with an autoscaling node pool; Installed Dask with helm; Successfully port forwarded client and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found