HDF5 error when working with compressed NetCDF files and the dask multiprocessing scheduler
See original GitHub issueCode Sample, a copy-pastable example if possible
import xarray as xr
import numpy as np
import dask.multiprocessing
# Generate dummy data and build xarray dataset
mat = np.random.rand(10, 90, 90)
ds = xr.Dataset(data_vars={'foo': (('time', 'x', 'y'), mat)})
# Write dataset to netcdf without compression
ds.to_netcdf('dummy_data_3d.nc')
# Write with zlib compersison
ds.to_netcdf('dummy_data_3d_with_compression.nc',
encoding={'foo': {'zlib': True}})
# Write data as int16 with scale factor applied
ds.to_netcdf('dummy_data_3d_with_scale_factor.nc',
encoding={'foo': {'dtype': 'int16',
'scale_factor': 0.01,
'_FillValue': -9999}})
# Load data from netCDF files
ds_vanilla = xr.open_dataset('dummy_data_3d.nc', chunks={'time': 1})
ds_scaled = xr.open_dataset('dummy_data_3d_with_scale_factor.nc', chunks={'time': 1})
ds_compressed = xr.open_dataset('dummy_data_3d_with_compression.nc', chunks={'time': 1})
# Do computation using dask's multiprocessing scheduler
foo = ds_vanilla.foo.mean(dim=['x', 'y']).compute(get=dask.multiprocessing.get)
foo = ds_scaled.foo.mean(dim=['x', 'y']).compute(get=dask.multiprocessing.get)
foo = ds_compressed.foo.mean(dim=['x', 'y']).compute(get=dask.multiprocessing.get)
# The last line fails
Problem description
If NetCDF files are compressed (which is often the case) and opened with chunking enabled to use them with dask, computations using the multiprocessing scheduler fail. The above code shows this in a short example. The last line fails with a long HDF5 error log:
HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 140736213758912:
#000: H5Dio.c line 171 in H5Dread(): can't read data
major: Dataset
minor: Read failed
#001: H5Dio.c line 544 in H5D__read(): can't read data
major: Dataset
minor: Read failed
#002: H5Dchunk.c line 2022 in H5D__chunk_read(): error looking up chunk address
major: Dataset
minor: Can't get value
#003: H5Dchunk.c line 2768 in H5D__chunk_lookup(): can't query chunk address
major: Dataset
minor: Can't get value
#004: H5Dbtree.c line 1047 in H5D__btree_idx_get_addr(): can't get chunk info
major: Dataset
minor: Can't get value
#005: H5B.c line 341 in H5B_find(): unable to load B-tree node
major: B-Tree node
minor: Unable to protect metadata
#006: H5AC.c line 1763 in H5AC_protect(): H5C_protect() failed
major: Object cache
minor: Unable to protect metadata
#007: H5C.c line 2561 in H5C_protect(): can't load entry
major: Object cache
minor: Unable to load metadata into cache
#008: H5C.c line 6877 in H5C_load_entry(): Can't deserialize image
major: Object cache
minor: Unable to load metadata into cache
#009: H5Bcache.c line 181 in H5B__cache_deserialize(): wrong B-tree signature
major: B-Tree node
minor: Bad value
Traceback (most recent call last):
File "hdf5_bug_minimal_working_example.py", line 27, in <module>
foo = ds_compressed.foo.mean(dim=['x', 'y']).compute(get=dask.multiprocessing.get)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/dataarray.py", line 658, in compute
return new.load(**kwargs)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/dataarray.py", line 632, in load
ds = self._to_temp_dataset().load(**kwargs)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/dataset.py", line 491, in load
evaluated_data = da.compute(*lazy_data.values(), **kwargs)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/base.py", line 333, in compute
results = get(dsk, keys, **kwargs)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/multiprocessing.py", line 177, in get
raise_exception=reraise, **kwargs)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 521, in get_async
raise_exception(exc, tb)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 290, in execute_task
result = _execute_task(task, data)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 270, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 270, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 267, in _execute_task
return [_execute_task(a, cache) for a in arg]
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 271, in _execute_task
return func(*args2)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/array/core.py", line 72, in getter
c = np.asarray(c)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/indexing.py", line 538, in __array__
return np.asarray(self.array, dtype=dtype)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/indexing.py", line 505, in __array__
return np.asarray(array[self.key], dtype=None)
File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/backends/netCDF4_.py", line 61, in __getitem__
data = getitem(self.get_array(), key)
File "netCDF4/_netCDF4.pyx", line 3961, in netCDF4._netCDF4.Variable.__getitem__
File "netCDF4/_netCDF4.pyx", line 4798, in netCDF4._netCDF4.Variable._get
File "netCDF4/_netCDF4.pyx", line 1638, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: HDF error
A possible workaround, if the dataset fits into memory, is to use
ds = ds.persist()
I could split up my dataset to accomplish this, but the beauty of xarray and dask gets lost a little when doing this…
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: None.None
xarray: 0.10.0
pandas: 0.21.0
numpy: 1.13.3
scipy: 1.0.0
netCDF4: 1.3.1
h5netcdf: 0.5.0
Nio: None
bottleneck: 1.2.1
cyordereddict: 1.0.0
dask: 0.16.0
matplotlib: 2.1.0
cartopy: None
seaborn: 0.8.1
setuptools: 36.7.2
pip: 9.0.1
conda: 4.3.29
pytest: 3.2.5
IPython: 5.5.0
sphinx: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Dask array to HDF5 parallel write fails with multiprocessing ...
I think the problem with this code is that it writes simultaneously different chunks of data to same hdf5 file when you using ......
Read more >dask/dask - Gitter
When executing the worker on Windows 10 (installed with pip) and the scheduler on Fedora 30 (installed with pip) I get the following...
Read more >Create Dask Arrays - Dask documentation
For example, we can load a Dask array from an HDF5 file using h5py: ... For multiprocessing or distributed schedulers, the memory map...
Read more >Parallel Computing with Dask - | notebook.community
Image("http://dask.pydata.org/en/latest/_images/collections-schedulers.png") ... Dask array complements large on-disk array stores like HDF5, NetCDF, ...
Read more >What's New — xarray 0.11.0 documentation - PyData |
The function now requires dask >= 0.17.3 to work on dask-backed data (GH2074). ... with dask's multiprocessing or distributed schedulers (GH1981).
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This may be a limitation of multiprocessing with netCDF4. Can you try using dask’s distributed scheduler? That might work better, even on a single machine.
Any update to this? I got HDF error for both multiprocessing and distributed scheduler.