question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

trouble loading netcdf4 files with xarray on s3

See original GitHub issue

I’m working on allowing direct access to netcdf4/hdf5 file-like objects (https://github.com/pydata/xarray/pull/2782). This seems to be working fine with gcsfs, but not s3fs (versions 0.2 from conda-forge). Here is a gist with the relevant code and error traceback:

https://gist.github.com/scottyhq/304a3c4b4e198776b8d82fb3a9f300e3

and an abbreviated traceback here:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/Documents/GitHub/xarray/xarray/backends/file_manager.py in acquire(self, needs_lock)
    166             try:
--> 167                 file = self._cache[self._key]
    168             except KeyError:

~/Documents/GitHub/xarray/xarray/backends/lru_cache.py in __getitem__(self, key)
     40         with self._lock:
---> 41             value = self._cache[key]
     42             self._cache.move_to_end(key)

KeyError: [<function _open_h5netcdf_group at 0x11d8b0ae8>, (<S3File grfn-content-prod/S1-GUNW-A-R-137-tops-20181129_20181123-020010-43220N_41518N-PP-e2c7-v2_0_0.nc>,), 'r', (('group', '/science/grids/data'),)]

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview.__setitem__()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview.setitem_slice_assignment()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview_copy_contents()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView._err_extents()

ValueError: got differing extents in dimension 0 (got 8 and 59941567)

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

any guidance as to what might be going on here would be appreciated!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:36 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Apr 23, 2019

(I suppose this is why you want to encode all the options required for smooth working of a particular dataset into a catalog…)

0reactions
martindurantcommented, May 13, 2019

I don’t know the internals of h5netcdf, but i would hope it’s a range. You could time reading a whole array versus reading a single value; but it will not be linear, due to fixed costs of each connection and metadata lookups. For a slice, it would depend on exact layout and chunking. You may want to turn on s3fs debug logging.

On May 12, 2019 8:35:48 PM EDT, Paul Branson notifications@github.com wrote:

If I take a slice from a netcdf opened with s3fs+h5netcdf is it doing some> form of byte range request or essentially downloading the entire file into> a memory cache and then slicing?>

In which case we should always chunk on a file basis when using this method?>

On Wed., 8 May 2019, 11:19 pm Martin Durant, notifications@github.com> wrote:>

Are the cache contents serialized, or is the cache cleared before pickling> the file object?>

PS: the file-system is serialised in this process, including directory> listings. This is good or bad - you avoid potentially slow lookups when> opening the file, but the instance is bigger. I notice that gcsfs does not> preserve the listings cache. gcsfs came later and is, in some ways, better> designed (hence my attempt to consolidate such things into fsspec).>

—> You are receiving this because you were mentioned.> Reply to this email directly, view it on GitHub> https://github.com/dask/s3fs/issues/168#issuecomment-490529693, or mute> the thread>

https://github.com/notifications/unsubscribe-auth/ADG5WQCGSUQI7DCVYQS4TZ3PULVQ5ANCNFSM4G2N5YMA>

.>

– > You are receiving this because you were mentioned.> Reply to this email directly or view it on GitHub:> https://github.com/dask/s3fs/issues/168#issuecomment-491642673

– Sent from my Android device with K-9 Mail. Please excuse my brevity.

Read more comments on GitHub >

github_iconTop Results From Across the Web

trouble loading netcdf4 files with xarray on s3 #168 - GitHub
I'm working on allowing direct access to netcdf4/hdf5 file-like objects (pydata/xarray#2782). This seems to be working fine with gcsfs, ...
Read more >
Error trying to open NetCDF file with xarray from s3 bucket
I'm trying to open a .nc file from an S3 bucket using xarray, but I'm getting an error. Here's the method I'm using:...
Read more >
how to read S3 files in lambda using Xarray? - Stack Overflow
I am trying to read netCDF via Xarray and convert it to csv. Boto3 doesn`t work for reading netCDF4 and converting it to...
Read more >
Reading and writing files - Xarray
NetCDF groups are not supported as part of the Dataset data model. Instead, groups can be loaded individually as Dataset objects.
Read more >
Using Kerchunk to read NetCDF4 data on AWS S3 as Zarr for ...
from kerchunk.combine import MultiZarrToZarrimport xarray as xr ... First, we need to list the files on S3 that we want to generate.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found