Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation?

See original GitHub issue

Is there variable information stored in the h5 internal structure that makes the h5 file different on each generation?

I am writing fixed content in h5 file, but when i read the binary the checksum produced is different on each run:

import numpy as np
import h5py
import hashlib

d = np.ones((100,100), dtype=np.int8)

with h5py.File('data.h5', 'w') as hf:
    hf.create_dataset('dataset', data=d)


with open('data.h5', "rb") as f:
    digest = hashlib.md5(f.read()).hexdigest()
print(digest)

Every run of this script produces a different digest:

In [1]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
b1d4035a06358719d48cf89684f681ab

In [2]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
d97372937056b8a2b5628dba6c770e7a

In [3]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
ef3258302ab433cbfbd8cfd1d6d9f473

My understanding without much detailed h5 internals knowledge is that I should get the same digest as long as it’s produced on the same drivers and runtime. This holds true for simple files, see below:

import hashlib
with open('data.txt', 'w') as f:
    f.write('readme')

with open('data.txt', "rb") as f:
    digest = hashlib.md5(f.read()).hexdigest()
print(digest)

When run multiple times will produce the same digest:

In [2]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

In [3]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

In [4]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

Am I mistaken in the expected behavior of h5? Is it a bug? If not, Is there a way to make this reproducible if contents added are exactly the same?

My system info is as follows:

python -c 'import h5py; print(h5py.version.info)'     Exe on: 15:10:17 on 2021-07-04
Summary of the h5py configuration
---------------------------------

h5py    2.10.0
HDF5    1.10.6
Python  3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38)
[Clang 11.0.1 ]
sys.platform    darwin
sys.maxsize     9223372036854775807
numpy   1.20.3

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

suneeta-mallcommented, Jul 5, 2021

Ah thanks for that @takluyver! that worked! I agree it should be easier to do this, I might give it a crack in coming days so leaving this open

0reactions

suneeta-mallcommented, Jul 8, 2021

@tacaswell The platform issue I was talking about was happening on different content than simple numpy array in the example above. track_order had no impact on checksum being platform agnostic.

I suspect your hunch may be right there and it could be related to endianness. I will be interested in knowing the details but this is not a major problem for me for now. I will raise it up with hdf5 group shortly.

I will close this meanwhile. Thanks both @takluyver & @tacaswell