Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Caching seems to have no effect on read times

See original GitHub issue

I can’t seem to get any changes in performance for slicing into a 1d dataset axis by modifying caching arguments, even though reads are contiguous. In contrast, reading chunks exactly gives much better performance. Here’s an example:

# Setup
import h5py
import numpy as np

x = np.random.randint(0, 1000, int(1e7))
indices = np.sort(np.random.choice(int(1e7), int(1e4), replace=False))

with h5py.File("test.h5", "w") as f:
    f.create_dataset("x", data=x, chunks=True)

Now I’ll try and read in the array in way that should mostly be hitting the cache. I’ll show you the general schema of the benchmark, followed by just how the file was opened and timings for brevity

f = h5py.File("test.h5", "r")
dset = f["x"]

# In it's own cell:
%%timeit
for i in range(len(indices) - 1):
    s = slice(indices[i], indices[i+1])
    dset[s]
# Timing results:
# 1.21 s ± 6.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

f.close()

With a large cache:

f = h5py.File("test.h5", "r", rdcc_nbytes=100 * (1024 ** 2), rdcc_nslots=50000)
# Timing results:
# 1.17 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With no cache:

f = h5py.File("test.h5", "r", rdcc_nbytes=0)
# Timing results
# 1.14 s ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Setting rdcc_w0 to something, while using a large cache:

f = h5py.File("test.h5", "r", rdcc_nbytes=100 * (1024 ** 2), rdcc_nslots=50000, rdcc_w0=.5)
# Timing results
# 1.16 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

These timings all fluctuate a bit, but are all pretty similar. By contrast, I’ll just read in entire chunks:

%%timeit
cs = dset.chunks[0]
ts = dset.shape[0]
slice_gen = (slice(i*cs, min((i+1)*cs, ts)) for i in range(ts // cs + 1))
for s in slice_gen:
    dset[s]
# Timing results:
# 136 ms ± 955 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Any idea what’s up with this? This was mentioned before on the mailing list, but there didn’t seem to be much resolution.

Version info

h5py    2.10.0
HDF5    1.10.4
Python  3.7.4 (default, Sep  7 2019, 18:27:02) 
[Clang 10.0.1 (clang-1001.0.46.4)]
sys.platform    darwin
sys.maxsize     9223372036854775807
numpy   1.17.2

Issue Analytics

State:
Created 4 years ago
Comments:23 (15 by maintainers)

Top GitHub Comments

1reaction

takluyvercommented, Nov 24, 2020

I think I’m satisifed that there’s nothing specific to tackle here. 3.0 made some performance improvements, and while there may be more room to improve, I don’t think it’s worth turning this issue into a generic ‘better performance’ one. I believe we understand why the HDF5 chunk cache doesn’t seem to make much difference when data isn’t compressed (because the OS is also caching data read from file).

So I’ll close this. If someone strenuously disagrees, we can reopen it. Or if you identify a specific performance problem, please open a new issue.

0reactions

takluyvercommented, Oct 14, 2020

Thanks @ivirshup !

A little surprisingly, I’m only seeing small changes in read speeds from setting rdcc_nbytes=0 (though I am seeing some effect). My guess is that this might have to do with hitting the OS cache instead of the hdf5 one.

I think this is right, as your data is not compressed - once it’s in the OS disk cache, reading it is just one extra memory copy. I found at some point that the HDF5 cache is much more important with compressed data - which makes sense, because the HDF5 cache can store the decompressed data, whereas the disk cache has it compressed.