Default cast type of H5T_NATIVE_B8 bitfields (bool, int8, uint8)
See original GitHub issueAs briefly discussed in #821, reading datasets which contain bitfields (H5T_NATIVE_B8) is not possible unless you explicitly state the dtype:
import h5py
h5_file = h5py.File('BigDataFile.h5', 'r')
dset = h5_file.get('/Base/GroupA')
dt = [('Time', '<f8'),
('SubsetA', [('DataA1', '?'), ('DataA2', '<f8')])]
with dset.astype(dt):
data = dset[:]
whereas datasets without bitfields can be accessed as:
import h5py
h5_file = h5py.File('BigDataFile.h5', 'r')
dset = h5_file.get('/Base/GroupA')
data = dset[:]
Attempting to read a compound dataset in this way produces an error if it has at least one bitfield:
TypeError: No NumPy equivalent for TypeBitfieldID exists
Some of the suggestions from #821 were to automatically convert to bool, uint8 or int8. Imho I think it should convert to something to allow reading without explicitly stating the datatype, enabling most users to use the second code snippet.
If there was going to be a default, uint8 or int8 would achieve this and also not have any data loss compared to bool, although the original motivation for using bool in the pull request was PyTables interop. I’d suggest having uint8 - then it would work out of the box without requiring explicit casts, not have any data loss, and users who want it as something else can still do an explicit cast.
All comments are welcome 😃 I may be missing some understanding of h5py, so feel free to correct if there’s anything wrong!
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:7 (7 by maintainers)
I’d say if we’re going to do anything by default, uint8 is the obvious choice. If we do that, I’d also map the other bitfield types to their corresponding uint types for consistency.
Do we lose anything by mapping two HDF5 types to the same numpy dtype? Is there a convenient way for the user to check the HDF5 type of a dataset if
dset.dtype
is ambiguous? What if they want to create a dataset with a bitfield type?Another idea, what if there were a context manager to modify the type mappings, so instead of having to define the entire compound dtype for a dataset, you could do something like:
@takluyver I like your idea of using the context manager to specify the exact source and destination datatypes when casting. I’m running in to this issue at the moment; all of the datasets I have are compound data types containing float64 and bools where the bools are represented as bitfields. I originally thought that the changes in 2.10 (https://github.com/h5py/h5py/pull/821) might have solved my problems, but unfortunately the suggested method of reading tries to cast all data types within the dataset as the specified type, and since there is no defined conversion method from float to uint8 or bool, it fails.
It looks like this issue most closely matches the problems I’m facing, so I thought I’d add a comment to throw some weight/support to this issue. Can anybody advise if it has made it to the development backlog/scheduled for a specific release?
To contribute to the design discussion - I suspect that there wouldn’t be a way to maintain injectivity in the conversion since my understanding is that there isn’t a unique numpy data type that maps to a bitfield. If any other data type is mapped to uint8, then there will ambiguity so converting back to the bitfield type wouldn’t be possible. From my perspective, a default cast to either
np.bool
ornp.uint8
would satisfy my requirements, though it sounds like there’s a reasonable case to preferuint8
. As mentioned previously though, an ability to just specify what cast I want to perform when reading, i.e., “I want all bitfields to be cast to uint8” would be good too; something with more granularity than “Cast everything in this dataset to this datatype”