question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot create dataset from another astype wrapped dataset

See original GitHub issue

I’m running into an issue where I would like to upcast some integer data stored in an hdf5 file. I was hoping I could do this with something like:

file["dataset_int64"] = file["dataset_int32"].astype(np.int64)

But this throws an TypeError. Here’s and example:

Example

import h5py
import numpy as np

f = h5py.File("test.h5", "w")
f["a"] = np.ones(100, dtype=np.int32)
f["b"] = f["a"].astype(np.int64)

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-c57aa10ec2d9> in <module>
----> 1 f["b"] = f["a"].astype(np.int64)

/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj)
    409 
    410             else:
--> 411                 ds = self.create_dataset(None, data=obj)
    412                 h5o.link(ds.id, self.id, name, lcpl=lcpl)
    413 

/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    146                     group = self.require_group(parent_path)
    147 
--> 148             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    149             dset = dataset.Dataset(dsid)
    150             return dset

/usr/local/lib/python3.8/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
     87         else:
     88             dtype = numpy.dtype(dtype)
---> 89         tid = h5t.py_create(dtype, logical=1)
     90 
     91     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Expected behaviour

I expected it to make a copy of the dataset cast to a different type. This might be the wrong expectation, since it looks like group["b"] = group["a"] creates a link. If making a copy here would cause a problem with those semantics, I think a better error message would be helpful.

Version info

h5py    3.1.0
HDF5    1.12.0
Python  3.8.5 (default, Jul 23 2020, 15:50:11) 
[Clang 11.0.3 (clang-1103.0.32.62)]
sys.platform    darwin
sys.maxsize     9223372036854775807
numpy   1.19.4
cython (built with) 0.29.21
numpy (built against) 1.17.5
HDF5 (built against) 1.12.0

This also seems to happen in h5py 2.10

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
takluyvercommented, Nov 24, 2020

I’m not aware of an HDF5 function to copy part of a dataset, or to copy data with a conversion. H5Ocopy copies an object, such as a dataset, entirely - this is exposed in h5py as Group.copy(). But if you want to select part of the data or change its type, I think you will need to write a loop that reads and writes a chunk at a time.

If your data is stored in chunked format, the new Dataset.iter_chunks() method could be useful for this.

0reactions
takluyvercommented, Nov 25, 2020

If your source dataset is chunked, I would try something like this:

dst_ds = f.create_dataset_like('dst', src_ds, dtype=np.int64)

for chunk in src_ds.iter_chunks():
    dst_ds[chunk] = src_ds[chunk]

Writing to a dataset with a different dtype should do the conversion anyway, but if you want it to be explicit, you can use iter_chunks on the source dataset, and then use ds.astype(np.float64)[chunk] (HDF5 does the conversion while reading), or ds[chunk].astype(np.float64) (NumPy does the conversion).

Trying this out f.copy(...), still creates a virtual dataset

That’s what I would expect - it is copying the object, not changing its contents. I’m stretching the filesystem analogy a bit, but cp foo.html bar.html doesn’t inline images which foo.html refers to.

I think it would be reasonable to ask HDF group (help@hdfgroup.org or on https://forum.hdfgroup.org/ ) if there’s a good way to copy the contents of a virtual dataset as a concrete dataset. I believe HDF5 can do this more efficiently than code using HDF5 APIs, because it can see the storage layout of the source data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do I split a custom dataset into training and test datasets?
Another way to do this is just hack your way through :). ... Assuming you have wrapped your data in a custom Dataset...
Read more >
tf.data: Build TensorFlow input pipelines
The tf.data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might...
Read more >
xarray.Dataset.assign_coords
Assigning multiple variables within the same assign_coords is possible, but you cannot reference other variables created within the same assign_coords call. See ...
Read more >
Solved: changes to spatial data frame? - Esri Community
Solved: I often use pd.DataFrame.spatial.from_layer(***) in my scripts to convert the layer to a pandas dataframe for running reports or ...
Read more >
Change the data type of columns in Pandas - LinkedIn
astype () - convert (almost) any type to (almost) any other type (even if it's not ... In this case, it can't cope...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found