Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

trouble loading netcdf4 files with xarray on s3

See original GitHub issue

I’m working on allowing direct access to netcdf4/hdf5 file-like objects (https://github.com/pydata/xarray/pull/2782). This seems to be working fine with gcsfs, but not s3fs (versions 0.2 from conda-forge). Here is a gist with the relevant code and error traceback:

https://gist.github.com/scottyhq/304a3c4b4e198776b8d82fb3a9f300e3

and an abbreviated traceback here:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/Documents/GitHub/xarray/xarray/backends/file_manager.py in acquire(self, needs_lock)
    166             try:
--> 167                 file = self._cache[self._key]
    168             except KeyError:

~/Documents/GitHub/xarray/xarray/backends/lru_cache.py in __getitem__(self, key)
     40         with self._lock:
---> 41             value = self._cache[key]
     42             self._cache.move_to_end(key)

KeyError: [<function _open_h5netcdf_group at 0x11d8b0ae8>, (<S3File grfn-content-prod/S1-GUNW-A-R-137-tops-20181129_20181123-020010-43220N_41518N-PP-e2c7-v2_0_0.nc>,), 'r', (('group', '/science/grids/data'),)]

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview.__setitem__()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview.setitem_slice_assignment()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview_copy_contents()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView._err_extents()

ValueError: got differing extents in dimension 0 (got 8 and 59941567)

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

any guidance as to what might be going on here would be appreciated!

Issue Analytics

State:
Created 5 years ago
Comments:36 (19 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Apr 23, 2019

(I suppose this is why you want to encode all the options required for smooth working of a particular dataset into a catalog…)

0reactions

martindurantcommented, May 13, 2019

I don’t know the internals of h5netcdf, but i would hope it’s a range. You could time reading a whole array versus reading a single value; but it will not be linear, due to fixed costs of each connection and metadata lookups. For a slice, it would depend on exact layout and chunking. You may want to turn on s3fs debug logging.

On May 12, 2019 8:35:48 PM EDT, Paul Branson notifications@github.com wrote:

If I take a slice from a netcdf opened with s3fs+h5netcdf is it doing some> form of byte range request or essentially downloading the entire file into> a memory cache and then slicing?>

In which case we should always chunk on a file basis when using this method?>

On Wed., 8 May 2019, 11:19 pm Martin Durant, notifications@github.com> wrote:>

Are the cache contents serialized, or is the cache cleared before pickling> the file object?>

PS: the file-system is serialised in this process, including directory> listings. This is good or bad - you avoid potentially slow lookups when> opening the file, but the instance is bigger. I notice that gcsfs does not> preserve the listings cache. gcsfs came later and is, in some ways, better> designed (hence my attempt to consolidate such things into fsspec).>

—> You are receiving this because you were mentioned.> Reply to this email directly, view it on GitHub> https://github.com/dask/s3fs/issues/168#issuecomment-490529693, or mute> the thread>

https://github.com/notifications/unsubscribe-auth/ADG5WQCGSUQI7DCVYQS4TZ3PULVQ5ANCNFSM4G2N5YMA>

.>

– > You are receiving this because you were mentioned.> Reply to this email directly or view it on GitHub:> https://github.com/dask/s3fs/issues/168#issuecomment-491642673