Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance compared to PyTables

See original GitHub issue

I recently found both zarr and PyTables (finally, a stable replacement to using CSVs…) and was wondering if I’m doing something wrong in my choice of chunk shapes here. My data is roughly a 100000 x 20000 int64 array, fairly compressible, and I need to access it from multiple processes (I’m using PyTorch, which spawns multiple workers). I only really need to access a full row at a time, so I’ve been setting the chunk size to None on the second dimension.

However, my reads seem to be about 30x slower in zarr than in PyTables, despite using the same compressor/filter (blosc-blosclz). I can’t quite reproduce this magnitude of difference using synthetic data, but the example below zarr is about 8x slower than PyTables.

Am I doing something wrong, or is this expected?

import os
import sys

import numcodecs
from numcodecs import Blosc
import numpy as np
import tables
import zarr


def access(z, n=100000):
    i = np.random.randint(0, 100000)
    return z.data[i]


def access_tables(t, n=100000):
    i = np.random.randint(0, 100000)
    return t.root.data[i]


def create_zarr(path, n=100000, shape=(0, 20000), chunks=(100, None)):
    if os.path.exists(path):
        return zarr.open(path)
    else:
        z = zarr.open(path, 'w')
    compressor = Blosc(cname='blosclz', clevel=7)
    arr = z.create('data', shape=shape, compressor=compressor, chunks=chunks)
    for _ in range(n):
        arr.append(np.random.randint(0, 10, (1, shape[1])))
    return z


def create_table(path, n=100000, shape=(0, 20000)):
    if os.path.exists(path):
        return tables.open_file(path)
    else:
        t = tables.open_file(path, 'w')
    filter = tables.Filters(7, 'blosc')
    a = tables.Float64Atom()
    arr = t.create_earray(
        t.root, 'data', a, shape, expectedrows=n, filters=filter,
    )
    for _ in range(n):
        arr.append(np.random.randint(0, 10, (1, shape[1])))
    return t


path = 'bench.{}'
z = create_zarr(path.format('zarr'))
t = create_table(path.format('h5'))

print('zarr info:')
print(z.data.info)
print('tables info:')
print(t.root.data)
print(t.root.data.filters)

print('zarr timings:')
%timeit access(z)
print('tables timings:')
%timeit access_tables(t)

print(f'zarr: {zarr.version.version}')
print(f'numcodecs: {numcodecs.version.version}')
print(f'tables: {tables.__version__}')
print(f'python: {sys.version_info}')
print(f'platform: {sys.platform}')

Output:

zarr info:
Name               : /data
Type               : zarr.core.Array
Data type          : float64
Shape              : (100000, 20000)
Chunk shape        : (100, 20000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='blosclz', clevel=7, shuffle=SHUFFLE,
                   : blocksize=0)
Store type         : zarr.storage.DirectoryStore
No. bytes          : 16000000000 (14.9G)
No. bytes stored   : 2219391813 (2.1G)
Storage ratio      : 7.2
Chunks initialized : 1000/1000

tables info:
/data (EArray(100000, 20000), shuffle, blosc(7)) ''
Filters(complevel=7, complib='blosc', shuffle=True, bitshuffle=False, fletcher32=False, least_significant_digit=None)
zarr timings:
4.11 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
tables timings:
560 µs ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
zarr: 2.2.0
numcodecs: 0.5.5
tables: 3.4.4
python: sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)
platform: linux

zarr/numcodecs/tables installed using conda.

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

sd2kcommented, Nov 14, 2018

Ace, thanks for the recommendation. I’ve settled for something low like 4 to strike a bit of a balance and it’s looking good!

And thanks for writing zarr 😃

0reactions

alimanfoocommented, Nov 14, 2018

Cool, no worries. FWIW if you are reading 1 row at a time then chunks=(1, None) will be fastest. If you can adapt your logic to read one chunk at a time into memory, then iterate over rows within chunks, then you obviously would have flexibility to have larger chunks, and larger chunks usually give (much) better read speed and compression ratio. Once #306 is in that would be handled transparently for you, but it’s not there yet.

Top Results From Across the Web

Optimization tips — PyTables 3.7.0 documentation

PyTables has many tunable features so that you can improve the performance of your application. If you are planning to deal with really...

pytables writes much faster than h5py. Why? - Stack Overflow

2. The performance difference between PyTables and h5py only improved slightly as dataset I/O size increased. 3. Pytables was 5.4x faster ...

Optimization tips — PyTables 3.0.0rc2 documentation

This simple change of mode selection can improve search times quite a lot and actually make PyTables very competitive when compared against typical...

IO tools (text, CSV, HDF5, …) - Pandas

PyTables offers better write performance when tables are compressed after they are written, as opposed to turning on compression at the very beginning....

Arrays vs. Tables for slice reads - Google Groups

Performance tests have showed that the memmap solution is 2x faster than PyTables in situation 2a, and PyTables is 1.4x faster than memmap...