question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Caching seems to have no effect on read times

See original GitHub issue

I can’t seem to get any changes in performance for slicing into a 1d dataset axis by modifying caching arguments, even though reads are contiguous. In contrast, reading chunks exactly gives much better performance. Here’s an example:

# Setup
import h5py
import numpy as np

x = np.random.randint(0, 1000, int(1e7))
indices = np.sort(np.random.choice(int(1e7), int(1e4), replace=False))

with h5py.File("test.h5", "w") as f:
    f.create_dataset("x", data=x, chunks=True)

Now I’ll try and read in the array in way that should mostly be hitting the cache. I’ll show you the general schema of the benchmark, followed by just how the file was opened and timings for brevity

f = h5py.File("test.h5", "r")
dset = f["x"]

# In it's own cell:
%%timeit
for i in range(len(indices) - 1):
    s = slice(indices[i], indices[i+1])
    dset[s]
# Timing results:
# 1.21 s ± 6.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

f.close()

With a large cache:

f = h5py.File("test.h5", "r", rdcc_nbytes=100 * (1024 ** 2), rdcc_nslots=50000)
# Timing results:
# 1.17 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With no cache:

f = h5py.File("test.h5", "r", rdcc_nbytes=0)
# Timing results
# 1.14 s ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Setting rdcc_w0 to something, while using a large cache:

f = h5py.File("test.h5", "r", rdcc_nbytes=100 * (1024 ** 2), rdcc_nslots=50000, rdcc_w0=.5)
# Timing results
# 1.16 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

These timings all fluctuate a bit, but are all pretty similar. By contrast, I’ll just read in entire chunks:

%%timeit
cs = dset.chunks[0]
ts = dset.shape[0]
slice_gen = (slice(i*cs, min((i+1)*cs, ts)) for i in range(ts // cs + 1))
for s in slice_gen:
    dset[s]
# Timing results:
# 136 ms ± 955 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Any idea what’s up with this? This was mentioned before on the mailing list, but there didn’t seem to be much resolution.

Version info

h5py    2.10.0
HDF5    1.10.4
Python  3.7.4 (default, Sep  7 2019, 18:27:02) 
[Clang 10.0.1 (clang-1001.0.46.4)]
sys.platform    darwin
sys.maxsize     9223372036854775807
numpy   1.17.2

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:23 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
takluyvercommented, Nov 24, 2020

I think I’m satisifed that there’s nothing specific to tackle here. 3.0 made some performance improvements, and while there may be more room to improve, I don’t think it’s worth turning this issue into a generic ‘better performance’ one. I believe we understand why the HDF5 chunk cache doesn’t seem to make much difference when data isn’t compressed (because the OS is also caching data read from file).

So I’ll close this. If someone strenuously disagrees, we can reopen it. Or if you identify a specific performance problem, please open a new issue.

0reactions
takluyvercommented, Oct 14, 2020

Thanks @ivirshup !

A little surprisingly, I’m only seeing small changes in read speeds from setting rdcc_nbytes=0 (though I am seeing some effect). My guess is that this might have to do with hitting the OS cache instead of the hdf5 one.

I think this is right, as your data is not compressed - once it’s in the OS disk cache, reading it is just one extra memory copy. I found at some point that the HDF5 cache is much more important with compressed data - which makes sense, because the HDF5 cache can store the decompressed data, whereas the disk cache has it compressed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How Do I Fix My Caching Problems Or Clear Web Browser's ...
Try holding down the Shift key while pressing the Refresh button. · Close your browser and re-open it (make sure you are NOT...
Read more >
bcache and very poor (nonexistent) read caching performance
Hi,. I seem to be having problems getting bcache working properly at the moment - seems like read caching performance is very bad....
Read more >
How to Fix the 'Cached Preview' Error in After Effects
To troubleshoot this go to After Effects>Preferences>Media & Disk Cache. Once the popup window appears increase the size of your disk cache.
Read more >
python - Tensorflow tf.data.Dataset.cache seems do not take ...
The first time the dataset is iterated over, its elements will be cached either in the specified file or in memory. Subsequent iterations...
Read more >
What is Caching and How it Works - AWS
A cache miss occurs when the data fetched was not present in the cache. Controls such as TTLs (Time to live) can be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found