open_mfdataset: support for multiple zarr datasets
See original GitHub issueI am running calculations on a remote Dask cluster. Some of the data is only available on the workers, not on the client. It is already possible to have an xarray dataset that “points” to a remote NetCDF data collection by using the parallel
option with xarray.open_mfdataset()
like this:
from dask.distributed import Client
import xarray as xr
client = Client('<dask_scheduler_ip>:<port>')
ds = xr.open_mfdataset(remote_nc_file_paths, combine='by_coords', parallel=True)
Then it will use dask.delayed
and, for example, the following simple mean calculation will be distributed between the workers, the result returned to the client:
ds['Temp'].mean().compute()
Unfortunately, I cannot do the same thing with zarr, because open_mfdataset()
does not support it, and open_zarr()
does not have an option to utilize dask.delayed
. Would it be possible to add dask.delayed
support to the zarr backend? Or, perhaps, I am missing something, and there is another better way to work with zarr data on a remote Dask cluster?
Output of xr.show_versions()
:
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7 |Anaconda custom (64-bit)| (default, Oct 23 2018, 19:16:44)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.2.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
libhdf5: 1.10.4
libnetcdf: 4.6.3
xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.3
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.8.0
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.9.1
distributed: 2.9.1
matplotlib: 3.1.2
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 40.4.3
pip: 18.1
conda: 4.8.0
pytest: 3.8.2
IPython: 7.0.1
sphinx: 1.8.1
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (7 by maintainers)
Top GitHub Comments
It would be wonderful if we could translate this complex xarray issue into a minimally simple zarr issue. Then the zarr devs can decide whether this use case is compatible with the zarr spec or not.
@jhamman I did already confirm it with a zarr-only test, pickling and unpickling a zarr group object. I get the same error as with an xarray dataset:
ValueError: group not found at path ''
Not sure if we can call it a bug though. According to the storage specification https://zarr.readthedocs.io/en/stable/spec/v2.html#storage for a group to exist a
.zgroup
key must exist under the corresponding logical path, so in the case of DirectoryStore it’s natural to check if a.zgroup
file exists at group object creation time.