Cannot create dataset from another astype wrapped dataset
See original GitHub issueI’m running into an issue where I would like to upcast some integer data stored in an hdf5 file. I was hoping I could do this with something like:
file["dataset_int64"] = file["dataset_int32"].astype(np.int64)
But this throws an TypeError
. Here’s and example:
Example
import h5py
import numpy as np
f = h5py.File("test.h5", "w")
f["a"] = np.ones(100, dtype=np.int32)
f["b"] = f["a"].astype(np.int64)
Traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-c57aa10ec2d9> in <module>
----> 1 f["b"] = f["a"].astype(np.int64)
/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj)
409
410 else:
--> 411 ds = self.create_dataset(None, data=obj)
412 h5o.link(ds.id, self.id, name, lcpl=lcpl)
413
/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
146 group = self.require_group(parent_path)
147
--> 148 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
149 dset = dataset.Dataset(dsid)
150 return dset
/usr/local/lib/python3.8/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
87 else:
88 dtype = numpy.dtype(dtype)
---> 89 tid = h5t.py_create(dtype, logical=1)
90
91 # Legacy
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
Expected behaviour
I expected it to make a copy of the dataset cast to a different type. This might be the wrong expectation, since it looks like group["b"] = group["a"]
creates a link. If making a copy here would cause a problem with those semantics, I think a better error message would be helpful.
Version info
h5py 3.1.0
HDF5 1.12.0
Python 3.8.5 (default, Jul 23 2020, 15:50:11)
[Clang 11.0.3 (clang-1103.0.32.62)]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.19.4
cython (built with) 0.29.21
numpy (built against) 1.17.5
HDF5 (built against) 1.12.0
This also seems to happen in h5py 2.10
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
How do I split a custom dataset into training and test datasets?
Another way to do this is just hack your way through :). ... Assuming you have wrapped your data in a custom Dataset...
Read more >tf.data: Build TensorFlow input pipelines
The tf.data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might...
Read more >xarray.Dataset.assign_coords
Assigning multiple variables within the same assign_coords is possible, but you cannot reference other variables created within the same assign_coords call. See ...
Read more >Solved: changes to spatial data frame? - Esri Community
Solved: I often use pd.DataFrame.spatial.from_layer(***) in my scripts to convert the layer to a pandas dataframe for running reports or ...
Read more >Change the data type of columns in Pandas - LinkedIn
astype () - convert (almost) any type to (almost) any other type (even if it's not ... In this case, it can't cope...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I’m not aware of an HDF5 function to copy part of a dataset, or to copy data with a conversion. H5Ocopy copies an object, such as a dataset, entirely - this is exposed in h5py as Group.copy(). But if you want to select part of the data or change its type, I think you will need to write a loop that reads and writes a chunk at a time.
If your data is stored in chunked format, the new Dataset.iter_chunks() method could be useful for this.
If your source dataset is chunked, I would try something like this:
Writing to a dataset with a different dtype should do the conversion anyway, but if you want it to be explicit, you can use iter_chunks on the source dataset, and then use
ds.astype(np.float64)[chunk]
(HDF5 does the conversion while reading), ords[chunk].astype(np.float64)
(NumPy does the conversion).That’s what I would expect - it is copying the object, not changing its contents. I’m stretching the filesystem analogy a bit, but
cp foo.html bar.html
doesn’t inline images which foo.html refers to.I think it would be reasonable to ask HDF group (help@hdfgroup.org or on https://forum.hdfgroup.org/ ) if there’s a good way to copy the contents of a virtual dataset as a concrete dataset. I believe HDF5 can do this more efficiently than code using HDF5 APIs, because it can see the storage layout of the source data.