Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation?
See original GitHub issueIs there variable information stored in the h5 internal structure that makes the h5 file different on each generation?
I am writing fixed content in h5 file, but when i read the binary the checksum produced is different on each run:
import numpy as np
import h5py
import hashlib
d = np.ones((100,100), dtype=np.int8)
with h5py.File('data.h5', 'w') as hf:
hf.create_dataset('dataset', data=d)
with open('data.h5', "rb") as f:
digest = hashlib.md5(f.read()).hexdigest()
print(digest)
Every run of this script produces a different digest:
In [1]: import numpy as np
...: import h5py
...: import hashlib
...:
...: d = np.ones((100,100), dtype=np.int8)
...:
...: with h5py.File('data.h5', 'w') as hf:
...: hf.create_dataset('dataset', data=d)
...:
...:
...: with open('data.h5', "rb") as f:
...: digest = hashlib.md5(f.read()).hexdigest()
...: print(digest)
b1d4035a06358719d48cf89684f681ab
In [2]: import numpy as np
...: import h5py
...: import hashlib
...:
...: d = np.ones((100,100), dtype=np.int8)
...:
...: with h5py.File('data.h5', 'w') as hf:
...: hf.create_dataset('dataset', data=d)
...:
...:
...: with open('data.h5', "rb") as f:
...: digest = hashlib.md5(f.read()).hexdigest()
...: print(digest)
d97372937056b8a2b5628dba6c770e7a
In [3]: import numpy as np
...: import h5py
...: import hashlib
...:
...: d = np.ones((100,100), dtype=np.int8)
...:
...: with h5py.File('data.h5', 'w') as hf:
...: hf.create_dataset('dataset', data=d)
...:
...:
...: with open('data.h5', "rb") as f:
...: digest = hashlib.md5(f.read()).hexdigest()
...: print(digest)
ef3258302ab433cbfbd8cfd1d6d9f473
My understanding without much detailed h5 internals knowledge is that I should get the same digest as long as it’s produced on the same drivers and runtime. This holds true for simple files, see below:
import hashlib
with open('data.txt', 'w') as f:
f.write('readme')
with open('data.txt', "rb") as f:
digest = hashlib.md5(f.read()).hexdigest()
print(digest)
When run multiple times will produce the same digest:
In [2]:
...: with open('data.txt', 'w') as f:
...: f.write('readme')
...:
...: with open('data.txt', "rb") as f:
...: digest = hashlib.md5(f.read()).hexdigest()
...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b
In [3]:
...: with open('data.txt', 'w') as f:
...: f.write('readme')
...:
...: with open('data.txt', "rb") as f:
...: digest = hashlib.md5(f.read()).hexdigest()
...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b
In [4]:
...: with open('data.txt', 'w') as f:
...: f.write('readme')
...:
...: with open('data.txt', "rb") as f:
...: digest = hashlib.md5(f.read()).hexdigest()
...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b
Am I mistaken in the expected behavior of h5? Is it a bug? If not, Is there a way to make this reproducible if contents added are exactly the same?
My system info is as follows:
python -c 'import h5py; print(h5py.version.info)' Exe on: 15:10:17 on 2021-07-04
Summary of the h5py configuration
---------------------------------
h5py 2.10.0
HDF5 1.10.6
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38)
[Clang 11.0.1 ]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.20.3
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Introduction to HDF5
A dataset is stored in a file in two parts: a header and a data array. ... HDF5 allows one to define many...
Read more >How to use HDF5 files in Python
Each HDF5 file has an internal structure that allows you to search for a specific dataset. You can think of it as a...
Read more >Groups — h5py 3.7.0 documentation
Groups are the container mechanism by which HDF5 files are organized. From a Python perspective, they operate somewhat like dictionaries.
Read more >Chapter 6: HDF5 Datatypes
An HDF5 dataset is an array of data elements, arranged according to the specifications of the dataspace. In general, a data element is...
Read more >Read multiple datasets from same Group in h5 file using h5py
This answer uses a generator to merge data from several files with ... etc., be the same in each of the hdf5 groups...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ah thanks for that @takluyver! that worked! I agree it should be easier to do this, I might give it a crack in coming days so leaving this open
@tacaswell The platform issue I was talking about was happening on different content than simple numpy array in the example above.
track_order
had no impact on checksum being platform agnostic.I suspect your hunch may be right there and it could be related to endianness. I will be interested in knowing the details but this is not a major problem for me for now. I will raise it up with hdf5 group shortly.
I will close this meanwhile. Thanks both @takluyver & @tacaswell