Caching seems to have no effect on read times
See original GitHub issueI can’t seem to get any changes in performance for slicing into a 1d dataset axis by modifying caching arguments, even though reads are contiguous. In contrast, reading chunks exactly gives much better performance. Here’s an example:
# Setup
import h5py
import numpy as np
x = np.random.randint(0, 1000, int(1e7))
indices = np.sort(np.random.choice(int(1e7), int(1e4), replace=False))
with h5py.File("test.h5", "w") as f:
f.create_dataset("x", data=x, chunks=True)
Now I’ll try and read in the array in way that should mostly be hitting the cache. I’ll show you the general schema of the benchmark, followed by just how the file was opened and timings for brevity
f = h5py.File("test.h5", "r")
dset = f["x"]
# In it's own cell:
%%timeit
for i in range(len(indices) - 1):
s = slice(indices[i], indices[i+1])
dset[s]
# Timing results:
# 1.21 s ± 6.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
f.close()
With a large cache:
f = h5py.File("test.h5", "r", rdcc_nbytes=100 * (1024 ** 2), rdcc_nslots=50000)
# Timing results:
# 1.17 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With no cache:
f = h5py.File("test.h5", "r", rdcc_nbytes=0)
# Timing results
# 1.14 s ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Setting rdcc_w0
to something, while using a large cache:
f = h5py.File("test.h5", "r", rdcc_nbytes=100 * (1024 ** 2), rdcc_nslots=50000, rdcc_w0=.5)
# Timing results
# 1.16 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
These timings all fluctuate a bit, but are all pretty similar. By contrast, I’ll just read in entire chunks:
%%timeit
cs = dset.chunks[0]
ts = dset.shape[0]
slice_gen = (slice(i*cs, min((i+1)*cs, ts)) for i in range(ts // cs + 1))
for s in slice_gen:
dset[s]
# Timing results:
# 136 ms ± 955 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Any idea what’s up with this? This was mentioned before on the mailing list, but there didn’t seem to be much resolution.
Version info
h5py 2.10.0
HDF5 1.10.4
Python 3.7.4 (default, Sep 7 2019, 18:27:02)
[Clang 10.0.1 (clang-1001.0.46.4)]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.17.2
Issue Analytics
- State:
- Created 4 years ago
- Comments:23 (15 by maintainers)
Top Results From Across the Web
How Do I Fix My Caching Problems Or Clear Web Browser's ...
Try holding down the Shift key while pressing the Refresh button. · Close your browser and re-open it (make sure you are NOT...
Read more >bcache and very poor (nonexistent) read caching performance
Hi,. I seem to be having problems getting bcache working properly at the moment - seems like read caching performance is very bad....
Read more >How to Fix the 'Cached Preview' Error in After Effects
To troubleshoot this go to After Effects>Preferences>Media & Disk Cache. Once the popup window appears increase the size of your disk cache.
Read more >python - Tensorflow tf.data.Dataset.cache seems do not take ...
The first time the dataset is iterated over, its elements will be cached either in the specified file or in memory. Subsequent iterations...
Read more >What is Caching and How it Works - AWS
A cache miss occurs when the data fetched was not present in the cache. Controls such as TTLs (Time to live) can be...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think I’m satisifed that there’s nothing specific to tackle here. 3.0 made some performance improvements, and while there may be more room to improve, I don’t think it’s worth turning this issue into a generic ‘better performance’ one. I believe we understand why the HDF5 chunk cache doesn’t seem to make much difference when data isn’t compressed (because the OS is also caching data read from file).
So I’ll close this. If someone strenuously disagrees, we can reopen it. Or if you identify a specific performance problem, please open a new issue.
Thanks @ivirshup !
I think this is right, as your data is not compressed - once it’s in the OS disk cache, reading it is just one extra memory copy. I found at some point that the HDF5 cache is much more important with compressed data - which makes sense, because the HDF5 cache can store the decompressed data, whereas the disk cache has it compressed.