Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

open_mfdataset: support for multiple zarr datasets

See original GitHub issue

I am running calculations on a remote Dask cluster. Some of the data is only available on the workers, not on the client. It is already possible to have an xarray dataset that “points” to a remote NetCDF data collection by using the parallel option with xarray.open_mfdataset() like this:

from dask.distributed import Client
import xarray as xr

client = Client('<dask_scheduler_ip>:<port>')
ds = xr.open_mfdataset(remote_nc_file_paths, combine='by_coords', parallel=True)

Then it will use dask.delayed and, for example, the following simple mean calculation will be distributed between the workers, the result returned to the client:

ds['Temp'].mean().compute()

Unfortunately, I cannot do the same thing with zarr, because open_mfdataset() does not support it, and open_zarr() does not have an option to utilize dask.delayed. Would it be possible to add dask.delayed support to the zarr backend? Or, perhaps, I am missing something, and there is another better way to work with zarr data on a remote Dask cluster?

Output of xr.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7 |Anaconda custom (64-bit)| (default, Oct 23 2018, 19:16:44) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.2.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
libhdf5: 1.10.4
libnetcdf: 4.6.3

xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.3
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.8.0
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.9.1
distributed: 2.9.1
matplotlib: 3.1.2
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 40.4.3
pip: 18.1
conda: 4.8.0
pytest: 3.8.2
IPython: 7.0.1
sphinx: 1.8.1

Issue Analytics

State:
Created 4 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

rabernatcommented, Jan 13, 2020

It would be wonderful if we could translate this complex xarray issue into a minimally simple zarr issue. Then the zarr devs can decide whether this use case is compatible with the zarr spec or not.

1reaction

dmedvcommented, Jan 13, 2020

@jhamman I did already confirm it with a zarr-only test, pickling and unpickling a zarr group object. I get the same error as with an xarray dataset: ValueError: group not found at path ''

Not sure if we can call it a bug though. According to the storage specification https://zarr.readthedocs.io/en/stable/spec/v2.html#storage for a group to exist a .zgroup key must exist under the corresponding logical path, so in the case of DirectoryStore it’s natural to check if a .zgroup file exists at group object creation time.

Top Results From Across the Web

How to read multiple zarr archives at once from s3? - Pangeo

Dataset object with the data from s3 """ check_aws_env_vars() s3_out = s3fs.S3FileSystem(anon=False) return xarray.open_zarr( store=s3fs.

xarray.open_mfdataset

Open multiple files as a single dataset. If combine='by_coords' then the function combine_by_coords is used to combine the datasets into one before ...

Open root zarr with multiple groups using xarray

Currently I am experimenting with a hack where I list the individual zarr groups using glob and feeding them as a list to...

Accessing and Reading Zarr - MesoWest

While direct download through the browser is supported for individual ... just the F00 analysis data or fcst (forecast) for a zarr array...

Tutorial — zarr 2.13.3 documentation - Read the Docs

As of version 2.2, Zarr arrays support several methods for advanced or ... on NumPy arrays and on h5py datasets, the Zarr API...