question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance compared to PyTables

See original GitHub issue

I recently found both zarr and PyTables (finally, a stable replacement to using CSVs…) and was wondering if I’m doing something wrong in my choice of chunk shapes here. My data is roughly a 100000 x 20000 int64 array, fairly compressible, and I need to access it from multiple processes (I’m using PyTorch, which spawns multiple workers). I only really need to access a full row at a time, so I’ve been setting the chunk size to None on the second dimension.

However, my reads seem to be about 30x slower in zarr than in PyTables, despite using the same compressor/filter (blosc-blosclz). I can’t quite reproduce this magnitude of difference using synthetic data, but the example below zarr is about 8x slower than PyTables.

Am I doing something wrong, or is this expected?

import os
import sys

import numcodecs
from numcodecs import Blosc
import numpy as np
import tables
import zarr


def access(z, n=100000):
    i = np.random.randint(0, 100000)
    return z.data[i]


def access_tables(t, n=100000):
    i = np.random.randint(0, 100000)
    return t.root.data[i]


def create_zarr(path, n=100000, shape=(0, 20000), chunks=(100, None)):
    if os.path.exists(path):
        return zarr.open(path)
    else:
        z = zarr.open(path, 'w')
    compressor = Blosc(cname='blosclz', clevel=7)
    arr = z.create('data', shape=shape, compressor=compressor, chunks=chunks)
    for _ in range(n):
        arr.append(np.random.randint(0, 10, (1, shape[1])))
    return z


def create_table(path, n=100000, shape=(0, 20000)):
    if os.path.exists(path):
        return tables.open_file(path)
    else:
        t = tables.open_file(path, 'w')
    filter = tables.Filters(7, 'blosc')
    a = tables.Float64Atom()
    arr = t.create_earray(
        t.root, 'data', a, shape, expectedrows=n, filters=filter,
    )
    for _ in range(n):
        arr.append(np.random.randint(0, 10, (1, shape[1])))
    return t


path = 'bench.{}'
z = create_zarr(path.format('zarr'))
t = create_table(path.format('h5'))

print('zarr info:')
print(z.data.info)
print('tables info:')
print(t.root.data)
print(t.root.data.filters)

print('zarr timings:')
%timeit access(z)
print('tables timings:')
%timeit access_tables(t)

print(f'zarr: {zarr.version.version}')
print(f'numcodecs: {numcodecs.version.version}')
print(f'tables: {tables.__version__}')
print(f'python: {sys.version_info}')
print(f'platform: {sys.platform}')

Output:

zarr info:
Name               : /data
Type               : zarr.core.Array
Data type          : float64
Shape              : (100000, 20000)
Chunk shape        : (100, 20000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='blosclz', clevel=7, shuffle=SHUFFLE,
                   : blocksize=0)
Store type         : zarr.storage.DirectoryStore
No. bytes          : 16000000000 (14.9G)
No. bytes stored   : 2219391813 (2.1G)
Storage ratio      : 7.2
Chunks initialized : 1000/1000

tables info:
/data (EArray(100000, 20000), shuffle, blosc(7)) ''
Filters(complevel=7, complib='blosc', shuffle=True, bitshuffle=False, fletcher32=False, least_significant_digit=None)
zarr timings:
4.11 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
tables timings:
560 µs ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
zarr: 2.2.0
numcodecs: 0.5.5
tables: 3.4.4
python: sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)
platform: linux

zarr/numcodecs/tables installed using conda.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sd2kcommented, Nov 14, 2018

Ace, thanks for the recommendation. I’ve settled for something low like 4 to strike a bit of a balance and it’s looking good!

And thanks for writing zarr 😃

0reactions
alimanfoocommented, Nov 14, 2018

Cool, no worries. FWIW if you are reading 1 row at a time then chunks=(1, None) will be fastest. If you can adapt your logic to read one chunk at a time into memory, then iterate over rows within chunks, then you obviously would have flexibility to have larger chunks, and larger chunks usually give (much) better read speed and compression ratio. Once #306 is in that would be handled transparently for you, but it’s not there yet.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Optimization tips — PyTables 3.7.0 documentation
PyTables has many tunable features so that you can improve the performance of your application. If you are planning to deal with really...
Read more >
pytables writes much faster than h5py. Why? - Stack Overflow
2. The performance difference between PyTables and h5py only improved slightly as dataset I/O size increased. 3. Pytables was 5.4x faster ...
Read more >
Optimization tips — PyTables 3.0.0rc2 documentation
This simple change of mode selection can improve search times quite a lot and actually make PyTables very competitive when compared against typical...
Read more >
IO tools (text, CSV, HDF5, …) - Pandas
PyTables offers better write performance when tables are compressed after they are written, as opposed to turning on compression at the very beginning....
Read more >
Arrays vs. Tables for slice reads - Google Groups
Performance tests have showed that the memmap solution is 2x faster than PyTables in situation 2a, and PyTables is 1.4x faster than memmap...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found