question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation?

See original GitHub issue

Is there variable information stored in the h5 internal structure that makes the h5 file different on each generation?

I am writing fixed content in h5 file, but when i read the binary the checksum produced is different on each run:

import numpy as np
import h5py
import hashlib

d = np.ones((100,100), dtype=np.int8)

with h5py.File('data.h5', 'w') as hf:
    hf.create_dataset('dataset', data=d)


with open('data.h5', "rb") as f:
    digest = hashlib.md5(f.read()).hexdigest()
print(digest)

Every run of this script produces a different digest:

In [1]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
b1d4035a06358719d48cf89684f681ab

In [2]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
d97372937056b8a2b5628dba6c770e7a

In [3]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
ef3258302ab433cbfbd8cfd1d6d9f473

My understanding without much detailed h5 internals knowledge is that I should get the same digest as long as it’s produced on the same drivers and runtime. This holds true for simple files, see below:

import hashlib
with open('data.txt', 'w') as f:
    f.write('readme')

with open('data.txt', "rb") as f:
    digest = hashlib.md5(f.read()).hexdigest()
print(digest)

When run multiple times will produce the same digest:

In [2]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

In [3]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

In [4]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

Am I mistaken in the expected behavior of h5? Is it a bug? If not, Is there a way to make this reproducible if contents added are exactly the same?

My system info is as follows:

python -c 'import h5py; print(h5py.version.info)'     Exe on: 15:10:17 on 2021-07-04
Summary of the h5py configuration
---------------------------------

h5py    2.10.0
HDF5    1.10.6
Python  3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38)
[Clang 11.0.1 ]
sys.platform    darwin
sys.maxsize     9223372036854775807
numpy   1.20.3

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
suneeta-mallcommented, Jul 5, 2021

Ah thanks for that @takluyver! that worked! I agree it should be easier to do this, I might give it a crack in coming days so leaving this open

0reactions
suneeta-mallcommented, Jul 8, 2021

@tacaswell The platform issue I was talking about was happening on different content than simple numpy array in the example above. track_order had no impact on checksum being platform agnostic.

I suspect your hunch may be right there and it could be related to endianness. I will be interested in knowing the details but this is not a major problem for me for now. I will raise it up with hdf5 group shortly.

I will close this meanwhile. Thanks both @takluyver & @tacaswell

Read more comments on GitHub >

github_iconTop Results From Across the Web

Introduction to HDF5
A dataset is stored in a file in two parts: a header and a data array. ... HDF5 allows one to define many...
Read more >
How to use HDF5 files in Python
Each HDF5 file has an internal structure that allows you to search for a specific dataset. You can think of it as a...
Read more >
Groups — h5py 3.7.0 documentation
Groups are the container mechanism by which HDF5 files are organized. From a Python perspective, they operate somewhat like dictionaries.
Read more >
Chapter 6: HDF5 Datatypes
An HDF5 dataset is an array of data elements, arranged according to the specifications of the dataspace. In general, a data element is...
Read more >
Read multiple datasets from same Group in h5 file using h5py
This answer uses a generator to merge data from several files with ... etc., be the same in each of the hdf5 groups...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found