Disable writing dataset timestamps by default?
See original GitHub issueThis is a bit speculative, so I won’t push it if there doesn’t seem to be a consensus.
The option to disable writing timestamps (group.create_dataset(..., track_times=False)
) appears to have been there since 2013 (#271), based on a request from 2011 (#225). It’s useful because, in some circumstances, people want running the same code to produce an identical file. There are much bigger challenges for reproducibility, so an option you can simply turn off isn’t the end of the world, but it’s an extra little annoyance for people to deal with (e.g. on Stackoverflow, and issue #1919).
I just tried to actually inspect some timestamps (to work out how they behave with group.copy()
), and it took a while to work out how to do so.
- I can’t find how to make any of the HDF5 command line tools show timestamps (h5ls, h5dump, h5stat).
- The HDF5 C API favours H5Oget_info as the way to do this, but none of the time fields are exposed in h5py (even in the low-level). The HDF5 docs say that of 4 timestamps, only ctime is implemented.
h5py.h5g.get_objinfo().mtime
appears to work, although the HDF5 C function is deprecated in favour of H5Oget_info. Only datasets appear to get a timestamp by default, and you’d need the low-level API to enable it for groups, so it’s weird to use an H5G API for datasets.
I can’t see any issues asking for us to expose the timestamps on h5py.h5o.ObjInfo
, or to expose any kind of timestamp in the high-level API. I also did a couple of searches on the HDF forum, without finding much. I get a strong impression that very few people have any use for HDF5 object timestamps.
If more people want timestamps disabled than want to use timestamps, should we disable writing them by default? It would still be possible to enable them by passing track_times=True
if someone does want them.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:10 (8 by maintainers)
The author of rhdf5 (R bindings) said on the HDF forum that he made a similar change, released in May 2020, and so far has not had any complaints.
I think this is a convincing datapoint, because usually breaking something is the most reliable way to get feedback. 😉
If even people who want a timestamp are manually writing their own, that does suggest that writing the built-in timestamps is not a useful default.
I’m sympathetic to the argument that h5py should follow HDF5’s default, but when I went and looked at the HDF5 documentation, the default didn’t actually seem to be documented, which implies it’s not super important. Maybe if h5py makes the change, HDF5 will follow suit later.
I can see a range of possible motivations for checking exact equality rather than using h5diff. For one thing, it allows you to store the hash of the known good file rather than the file itself. You may also care about performance characteristics like chunking or where data is in a file. I also just like the idea that we can expect running the same code (inc. libraries) gives you the same output.