Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

zarr slower than npy, hdf5 etc?

See original GitHub issue

I got interested in the performance of zarr and did a comparison with npy, pickle, hdf5 etc. See https://stackoverflow.com/a/58942584/353337. To my surprise, I found zarr reads large arrays slower than npy. This is for random float data as well as more structured mesh data. I had expected zarr to take the cake using multiple cores. Perhaps this isn’t a good test for zarr to show its strength either.

out

Code to reproduce the plot:

import perfplot
import pickle
import numpy
import h5py
import tables
import zarr


def setup(n):
    data = numpy.random.rand(n)
    # import meshzoo
    # n = int(numpy.cbrt(n))
    # points, cells = meshzoo.cube(
    #    xmin=0.0, xmax=1.0, ymin=0.0, ymax=1.0, zmin=0.0, zmax=1.0, nx=n, ny=n, nz=n
    # )
    # data = cells
    # write all files
    #
    numpy.save("out.npy", data)
    #
    f = h5py.File("out.h5", "w")
    f.create_dataset("data", data=data)
    f.close()
    #
    with open("test.pkl", "wb") as f:
        pickle.dump(data, f)
    #
    f = tables.open_file("pytables.h5", mode="w")
    gcolumns = f.create_group(f.root, "columns", "data")
    f.create_array(gcolumns, "data", data, "data")
    f.close()
    #
    zarr.save("out.zip", data)
    zarr.save("out.zarr", data)


def npy_read(data):
    return numpy.load("out.npy")


def hdf5_read(data):
    f = h5py.File("out.h5", "r")
    out = f["data"][()]
    f.close()
    return out


def pickle_read(data):
    with open("test.pkl", "rb") as f:
        out = pickle.load(f)
    return out


def pytables_read(data):
    f = tables.open_file("pytables.h5", mode="r")
    out = f.root.columns.data[()]
    f.close()
    return out


def zarr_zarr_read(data):
    return zarr.load("out.zarr")


def zarr_zip_read(data):
    return zarr.load("out.zip")


b = perfplot.bench(
    setup=setup,
    kernels=[
        npy_read,
        hdf5_read,
        pickle_read,
        pytables_read,
        zarr_zarr_read,
        zarr_zip_read,
    ],
    n_range=[2 ** k for k in range(27)],
    xlabel="len(data)",
    title=f"zarr {zarr.__version__}",
)
b.save("out.png")
b.show()

Issue Analytics

State:
Created 4 years ago
Comments:24 (10 by maintainers)

Top GitHub Comments

2reactions

constantinpapecommented, Nov 22, 2019

I don’t know; I just used the default values.

Afaik, zarr uses blosc compression by default. h5py does not compress by default. Also, h5py does not chunk the data if you don’t specify chunks=True (or enable compression). Numpy and pickle neither compress nor chunk, I don’t know about pytables. So the comparison is not very fair.

FWIW when I benchmarked z5, which implements the zarr spec bin C++, I found the performance on par with hdf5 in single-threaded performance and better mult-threaded. Unfortunately I don’t have the results right now, the code is here.

1reaction

joshmoorecommented, Nov 15, 2022