question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

open_mfdataset: support for multiple zarr datasets

See original GitHub issue

I am running calculations on a remote Dask cluster. Some of the data is only available on the workers, not on the client. It is already possible to have an xarray dataset that “points” to a remote NetCDF data collection by using the parallel option with xarray.open_mfdataset() like this:

from dask.distributed import Client
import xarray as xr

client = Client('<dask_scheduler_ip>:<port>')
ds = xr.open_mfdataset(remote_nc_file_paths, combine='by_coords', parallel=True)

Then it will use dask.delayed and, for example, the following simple mean calculation will be distributed between the workers, the result returned to the client:

ds['Temp'].mean().compute()

Unfortunately, I cannot do the same thing with zarr, because open_mfdataset() does not support it, and open_zarr() does not have an option to utilize dask.delayed. Would it be possible to add dask.delayed support to the zarr backend? Or, perhaps, I am missing something, and there is another better way to work with zarr data on a remote Dask cluster?

Output of xr.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7 |Anaconda custom (64-bit)| (default, Oct 23 2018, 19:16:44) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.2.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
libhdf5: 1.10.4
libnetcdf: 4.6.3

xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.3
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.8.0
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.9.1
distributed: 2.9.1
matplotlib: 3.1.2
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 40.4.3
pip: 18.1
conda: 4.8.0
pytest: 3.8.2
IPython: 7.0.1
sphinx: 1.8.1

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
rabernatcommented, Jan 13, 2020

It would be wonderful if we could translate this complex xarray issue into a minimally simple zarr issue. Then the zarr devs can decide whether this use case is compatible with the zarr spec or not.

1reaction
dmedvcommented, Jan 13, 2020

@jhamman I did already confirm it with a zarr-only test, pickling and unpickling a zarr group object. I get the same error as with an xarray dataset: ValueError: group not found at path ''

Not sure if we can call it a bug though. According to the storage specification https://zarr.readthedocs.io/en/stable/spec/v2.html#storage for a group to exist a .zgroup key must exist under the corresponding logical path, so in the case of DirectoryStore it’s natural to check if a .zgroup file exists at group object creation time.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to read multiple zarr archives at once from s3? - Pangeo
Dataset object with the data from s3 """ check_aws_env_vars() s3_out = s3fs.S3FileSystem(anon=False) return xarray.open_zarr( store=s3fs.
Read more >
xarray.open_mfdataset
Open multiple files as a single dataset. If combine='by_coords' then the function combine_by_coords is used to combine the datasets into one before ...
Read more >
Open root zarr with multiple groups using xarray
Currently I am experimenting with a hack where I list the individual zarr groups using glob and feeding them as a list to...
Read more >
Accessing and Reading Zarr - MesoWest
While direct download through the browser is supported for individual ... just the F00 analysis data or fcst (forecast) for a zarr array...
Read more >
Tutorial — zarr 2.13.3 documentation - Read the Docs
As of version 2.2, Zarr arrays support several methods for advanced or ... on NumPy arrays and on h5py datasets, the Zarr API...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found