question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Geoh5py reading in s3

See original GitHub issue

An example of an hdf5 file format.

https://geoh5py.readthedocs.io/en/stable/

“The geoh5py library has been created for the manipulation and storage of a wide range of geoscientific data (points, curve, surface, 2D and 3D grids) in geoh5 file format. Users will be able to directly leverage the powerful visualization capabilities of Geoscience ANALYST along with open-source code from the Python ecosystem.”

So I use it quite a bit.

In fact, now have 200K similar files, in a couple of company s3 buckets.

e.g. image

So last night I started wondering if could access them quickly from s3 as opposed to generally read each one, get what I want and have several terabytes lying around.

I remembered the pangeo netcdf/hdf5 discussion etc, so I started having a look.

This would be pretty useful.

e.g. something like this? @rsignell-usgs

from kerchunk.hdf import SingleHdf5ToZarr
import kerchunk.hdf
import fsspec
s3 = s3fs.S3FileSystem(profile='appropriateprofile')

urls = ["s3://" + p for p in [
    's3://bananasplits/100075.geoh5'
]]
so = dict(
    anon=False, default_fill_cache=False, default_cache_type='first'
)
singles = []
for u in urls:
    with s3.open(u, **so) as inf:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
        singles.append(h5chunks.translate())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_24276/251853116.py in <module>
     13     with s3.open(u, **so) as inf:
     14         h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
---> 15         singles.append(h5chunks.translate())

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\kerchunk\hdf.py in translate(self)
     71         lggr.debug('Translation begins')
     72         self._transfer_attrs(self._h5f, self._zroot)
---> 73         self._h5f.visititems(self._translator)
     74         if self.inline > 0:
     75             self._do_inline(self.inline)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\h5py\_hl\group.py in visititems(self, func)
    611                 name = self._d(name)
    612                 return func(name, self[name])
--> 613             return h5o.visit(self.id, proxy)
    614 
    615     @with_phil

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\h5o.pyx in h5py.h5o.visit()

h5py\h5o.pyx in h5py.h5o.cb_obj_simple()

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\h5py\_hl\group.py in proxy(name)
    610                 """ Use the text name of the object, not bytes """
    611                 name = self._d(name)
--> 612                 return func(name, self[name])
    613             return h5o.visit(self.id, proxy)
    614 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\kerchunk\hdf.py in _translator(self, name, h5obj)
    188 
    189             # Create a Zarr array equivalent to this HDF5 dataset...
--> 190             za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape,
    191                                             dtype=h5obj.dtype,
    192                                             chunks=h5obj.chunks or False,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in create_dataset(self, name, **kwargs)
    806         """
    807 
--> 808         return self._write_op(self._create_dataset_nosync, name, **kwargs)
    809 
    810     def _create_dataset_nosync(self, name, data=None, **kwargs):

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in _write_op(self, f, *args, **kwargs)
    659 
    660         with lock:
--> 661             return f(*args, **kwargs)
    662 
    663     def create_group(self, name, overwrite=False):

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in _create_dataset_nosync(self, name, data, **kwargs)
    818         # create array
    819         if data is None:
--> 820             a = create(store=self._store, path=path, chunk_store=self._chunk_store,
    821                        **kwargs)
    822 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, dimension_separator, **kwargs)
    134 
    135     # initialize array metadata
--> 136     init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
    137                fill_value=fill_value, order=order, overwrite=overwrite, path=path,
    138                chunk_store=chunk_store, filters=filters, object_codec=object_codec,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec, dimension_separator)
    350     _require_parent_group(path, store=store, chunk_store=chunk_store, overwrite=overwrite)
    351 
--> 352     _init_array_metadata(store, shape=shape, chunks=chunks, dtype=dtype,
    353                          compressor=compressor, fill_value=fill_value,
    354                          order=order, overwrite=overwrite, path=path,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec, dimension_separator)
    427             if not filters:
    428                 # there are no filters so we can be sure there is no object codec
--> 429                 raise ValueError('missing object_codec for object array')
    430             else:
    431                 # one of the filters may be an object codec, issue a warning rather

ValueError: missing object_codec for object array

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:60 (60 by maintainers)

github_iconTop GitHub Comments

2reactions
martindurantcommented, Apr 1, 2022

I think your issue was in

fsspec.get_mapper("reference://", fo=out)

you needed

fsspec.get_mapper("reference://", fo=out, remote_protocol="s3")

and any S3 FS options in a dict remote_options.

2reactions
martindurantcommented, Apr 1, 2022

I did:

f = open("10075/10007post.geoh5", "rb")
h = kerchunk.hdf.SingleHdf5ToZarr(f, url="10075/10007post.geoh5")
out = h.translate()
m = fsspec.get_mapper("reference://", fo=out)
z = zarr.open(m)
Read more comments on GitHub >

github_iconTop Results From Across the Web

geoh5py - Read the Docs
geoh5py : Python API for geoh5. An open file format for geoscientific data¶. Welcome to the documentation page for geoh5py!
Read more >
geoh5py Documentation
geoh5 file is accessed in “read-write” mode. In the eventuality that the file is already used by. Geoscience ANALYST, the mode gets changed...
Read more >
Reading Data from AWS S3 - Stack Overflow
Some Python packages (such as Pandas) support reading data directly from S3, as it is the most popular location for data.
Read more >
Reading a Specific File from an S3 bucket Using Python
In this section we will look at how we can connect to AWS S3 using the boto3 library to access the objects stored...
Read more >
Reading and writing files from/to Amazon S3 with Pandas
Contents. Write pandas data frame to CSV file on S3; > Using boto3; > Using s3fs-supported pandas API; Read a CSV file on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found