question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

zarr slower than npy, hdf5 etc?

See original GitHub issue

I got interested in the performance of zarr and did a comparison with npy, pickle, hdf5 etc. See https://stackoverflow.com/a/58942584/353337. To my surprise, I found zarr reads large arrays slower than npy. This is for random float data as well as more structured mesh data. I had expected zarr to take the cake using multiple cores. Perhaps this isn’t a good test for zarr to show its strength either.

out

Code to reproduce the plot:

import perfplot
import pickle
import numpy
import h5py
import tables
import zarr


def setup(n):
    data = numpy.random.rand(n)
    # import meshzoo
    # n = int(numpy.cbrt(n))
    # points, cells = meshzoo.cube(
    #    xmin=0.0, xmax=1.0, ymin=0.0, ymax=1.0, zmin=0.0, zmax=1.0, nx=n, ny=n, nz=n
    # )
    # data = cells
    # write all files
    #
    numpy.save("out.npy", data)
    #
    f = h5py.File("out.h5", "w")
    f.create_dataset("data", data=data)
    f.close()
    #
    with open("test.pkl", "wb") as f:
        pickle.dump(data, f)
    #
    f = tables.open_file("pytables.h5", mode="w")
    gcolumns = f.create_group(f.root, "columns", "data")
    f.create_array(gcolumns, "data", data, "data")
    f.close()
    #
    zarr.save("out.zip", data)
    zarr.save("out.zarr", data)


def npy_read(data):
    return numpy.load("out.npy")


def hdf5_read(data):
    f = h5py.File("out.h5", "r")
    out = f["data"][()]
    f.close()
    return out


def pickle_read(data):
    with open("test.pkl", "rb") as f:
        out = pickle.load(f)
    return out


def pytables_read(data):
    f = tables.open_file("pytables.h5", mode="r")
    out = f.root.columns.data[()]
    f.close()
    return out


def zarr_zarr_read(data):
    return zarr.load("out.zarr")


def zarr_zip_read(data):
    return zarr.load("out.zip")


b = perfplot.bench(
    setup=setup,
    kernels=[
        npy_read,
        hdf5_read,
        pickle_read,
        pytables_read,
        zarr_zarr_read,
        zarr_zip_read,
    ],
    n_range=[2 ** k for k in range(27)],
    xlabel="len(data)",
    title=f"zarr {zarr.__version__}",
)
b.save("out.png")
b.show()

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:24 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
constantinpapecommented, Nov 22, 2019

I don’t know; I just used the default values.

Afaik, zarr uses blosc compression by default. h5py does not compress by default. Also, h5py does not chunk the data if you don’t specify chunks=True (or enable compression). Numpy and pickle neither compress nor chunk, I don’t know about pytables. So the comparison is not very fair.

FWIW when I benchmarked z5, which implements the zarr spec bin C++, I found the performance on par with hdf5 in single-threaded performance and better mult-threaded. Unfortunately I don’t have the results right now, the code is here.

1reaction
joshmoorecommented, Nov 15, 2022

I noticed there is an open issue on zarr+dask as well, so it made me unsure of the maturity of the duo: #962

I’ll update the description of this issue. Here the problem is that someone tried to wrap a dask in a zarr, but you should put a zarr in your dask. 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

Developers - zarr slower than npy, hdf5 etc? - - Bountysource
To my surprise, I found zarr reads large arrays slower than npy. ... I had expected zarr to take the cake using multiple...
Read more >
Loading NumPy arrays from disk: mmap() vs ... - Python⇒Speed
Learn how to load larger-than-memory NumPy arrays from disk using either mmap() (using numpy.memmap), or the very similar Zarr and HDF5 file ...
Read more >
Is there an analysis speed or memory usage advantage to ...
A memmap will have a fast best-case, but a very, very slow worst-case. h5py is better suited to datasets like yours than pytables...
Read more >
Comparison of Array Management Library Performance - SC19
Array management libraries, such as HDF5, Zarr, etc., depend on ... mapping arrays to files, several self-describing data and file for-.
Read more >
Moving away from HDF5 - Cyrille Rossant
In a simple benchmark with 3D arrays, Zarr was still slower to read data than h5py. h5py was head-to-head to numpy.save() (NPY format)....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found