Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Geoh5py reading in s3

See original GitHub issue

An example of an hdf5 file format.

https://geoh5py.readthedocs.io/en/stable/

“The geoh5py library has been created for the manipulation and storage of a wide range of geoscientific data (points, curve, surface, 2D and 3D grids) in geoh5 file format. Users will be able to directly leverage the powerful visualization capabilities of Geoscience ANALYST along with open-source code from the Python ecosystem.”

So I use it quite a bit.

In fact, now have 200K similar files, in a couple of company s3 buckets.

e.g.

So last night I started wondering if could access them quickly from s3 as opposed to generally read each one, get what I want and have several terabytes lying around.

I remembered the pangeo netcdf/hdf5 discussion etc, so I started having a look.

This would be pretty useful.

e.g. something like this? @rsignell-usgs

from kerchunk.hdf import SingleHdf5ToZarr
import kerchunk.hdf
import fsspec
s3 = s3fs.S3FileSystem(profile='appropriateprofile')

urls = ["s3://" + p for p in [
    's3://bananasplits/100075.geoh5'
]]
so = dict(
    anon=False, default_fill_cache=False, default_cache_type='first'
)
singles = []
for u in urls:
    with s3.open(u, **so) as inf:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
        singles.append(h5chunks.translate())

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_24276/251853116.py in <module>
     13     with s3.open(u, **so) as inf:
     14         h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
---> 15         singles.append(h5chunks.translate())

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\kerchunk\hdf.py in translate(self)
     71         lggr.debug('Translation begins')
     72         self._transfer_attrs(self._h5f, self._zroot)
---> 73         self._h5f.visititems(self._translator)
     74         if self.inline > 0:
     75             self._do_inline(self.inline)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\h5py\_hl\group.py in visititems(self, func)
    611                 name = self._d(name)
    612                 return func(name, self[name])
--> 613             return h5o.visit(self.id, proxy)
    614 
    615     @with_phil

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\h5o.pyx in h5py.h5o.visit()

h5py\h5o.pyx in h5py.h5o.cb_obj_simple()

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\h5py\_hl\group.py in proxy(name)
    610                 """ Use the text name of the object, not bytes """
    611                 name = self._d(name)
--> 612                 return func(name, self[name])
    613             return h5o.visit(self.id, proxy)
    614 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\kerchunk\hdf.py in _translator(self, name, h5obj)
    188 
    189             # Create a Zarr array equivalent to this HDF5 dataset...
--> 190             za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape,
    191                                             dtype=h5obj.dtype,
    192                                             chunks=h5obj.chunks or False,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in create_dataset(self, name, **kwargs)
    806         """
    807 
--> 808         return self._write_op(self._create_dataset_nosync, name, **kwargs)
    809 
    810     def _create_dataset_nosync(self, name, data=None, **kwargs):

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in _write_op(self, f, *args, **kwargs)
    659 
    660         with lock:
--> 661             return f(*args, **kwargs)
    662 
    663     def create_group(self, name, overwrite=False):

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in _create_dataset_nosync(self, name, data, **kwargs)
    818         # create array
    819         if data is None:
--> 820             a = create(store=self._store, path=path, chunk_store=self._chunk_store,
    821                        **kwargs)
    822 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, dimension_separator, **kwargs)
    134 
    135     # initialize array metadata
--> 136     init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
    137                fill_value=fill_value, order=order, overwrite=overwrite, path=path,
    138                chunk_store=chunk_store, filters=filters, object_codec=object_codec,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec, dimension_separator)
    350     _require_parent_group(path, store=store, chunk_store=chunk_store, overwrite=overwrite)
    351 
--> 352     _init_array_metadata(store, shape=shape, chunks=chunks, dtype=dtype,
    353                          compressor=compressor, fill_value=fill_value,
    354                          order=order, overwrite=overwrite, path=path,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec, dimension_separator)
    427             if not filters:
    428                 # there are no filters so we can be sure there is no object codec
--> 429                 raise ValueError('missing object_codec for object array')
    430             else:
    431                 # one of the filters may be an object codec, issue a warning rather

ValueError: missing object_codec for object array

Issue Analytics

State:
Created a year ago
Comments:60 (60 by maintainers)

Top GitHub Comments

2reactions

martindurantcommented, Apr 1, 2022

I think your issue was in

fsspec.get_mapper("reference://", fo=out)

you needed

fsspec.get_mapper("reference://", fo=out, remote_protocol="s3")

and any S3 FS options in a dict remote_options.

2reactions

martindurantcommented, Apr 1, 2022

I did:

f = open("10075/10007post.geoh5", "rb")
h = kerchunk.hdf.SingleHdf5ToZarr(f, url="10075/10007post.geoh5")
out = h.translate()
m = fsspec.get_mapper("reference://", fo=out)
z = zarr.open(m)