H5py and Dask Distributed
See original GitHub issueWhen I use Client to map a function to a Dask array made from an HDF5, the following error appears:
TypeError: can't pick _thread._local objects
Here is a simplified version of what I am trying to do:
import h5py
import numpy as np
from dask.distributed import Client
h5_f = h5py.File(h5_path, mode='r+')
client = Client()
#random 2d h5py dataset into Dask Array
arr = np.arange(100).reshape((10,10))
dset = h5_f.create_dataset("MyDataset", data=arr)
y = da.from_array(dset, chunks='auto')
#some function
def inc(x):
return x + 1
#client maps function, inc(), to dataset, y
#where error appears
L = client.map(inc, y)
#results
results = client.gather(L)
After some testing, I believe the issue to lay with HDF in a lazy dask array function, which perhaps is not pickle-able when used in the map() function.
I am trying to implement Dask into the pyUSID Python package, which is built on h5py, for spectroscopy and imaging computation. Therefore, I need to use Dask with HDF.
I am using Python person 3.7.3 on a MacBook Air with a 1.8 GHz Intel Core i7 (4-core) processor and 4 gb RAM.
Here is the traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/distributed/worker.py in dumps_function(func)
2728 try:
-> 2729 result = cache[func]
2730 except KeyError:
KeyError: <bound method SignalFilter._unit_computation of <dask_signal_filter.SignalFilter object at 0xa1579d128>>
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/distributed/protocol/pickle.py in dumps(x)
37 try:
---> 38 result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
39 if len(result) < 1000:
TypeError: can't pickle _thread._local objects
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-10-b67304d97a62> in <module>
10 #L = client.map(sig_filt._unit_computation, sig_filt.data)
11 #L
---> 12 h5_filt_grp = sig_filt.compute(override=True)
13 #sig_filt.data
~/Downloads/daskUSID/signal_filter/dask_process.py in compute(self, override, *args, **kwargs)
414 print('Scheduler info: {}'.format(client.scheduler_info()))
415
--> 416 L = client.map(self._unit_computation, self.data, *args, **kwargs)
417 if self.verbose:
418 progress(L)
/anaconda3/lib/python3.7/site-packages/distributed/client.py in map(self, func, *iterables, **kwargs)
1437 user_priority=user_priority,
1438 fifo_timeout=fifo_timeout,
-> 1439 actors=actor)
1440 logger.debug("map(%s, ...)", funcname(func))
1441
/anaconda3/lib/python3.7/site-packages/distributed/client.py in _graph_to_futures(self, dsk, keys, restrictions, loose_restrictions, priority, user_priority, resources, retries, fifo_timeout, actors)
2259
2260 self._send_to_scheduler({'op': 'update-graph',
-> 2261 'tasks': valmap(dumps_task, dsk3),
2262 'dependencies': dependencies,
2263 'keys': list(flatkeys),
/anaconda3/lib/python3.7/site-packages/cytoolz/dicttoolz.pyx in cytoolz.dicttoolz.valmap()
/anaconda3/lib/python3.7/site-packages/cytoolz/dicttoolz.pyx in cytoolz.dicttoolz.valmap()
/anaconda3/lib/python3.7/site-packages/distributed/worker.py in dumps_task(task)
2765 return d
2766 elif not any(map(_maybe_complex, task[1:])):
-> 2767 return {'function': dumps_function(task[0]),
2768 'args': warn_dumps(task[1:])}
2769 return to_serialize(task)
/anaconda3/lib/python3.7/site-packages/distributed/worker.py in dumps_function(func)
2729 result = cache[func]
2730 except KeyError:
-> 2731 result = pickle.dumps(func)
2732 if len(result) < 100000:
2733 cache[func] = result
/anaconda3/lib/python3.7/site-packages/distributed/protocol/pickle.py in dumps(x)
49 except Exception:
50 try:
---> 51 return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
52 except Exception as e:
53 logger.info("Failed to serialize %s. Exception: %s", x, e)
/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in dumps(obj, protocol)
950 try:
951 cp = CloudPickler(file, protocol=protocol)
--> 952 cp.dump(obj)
953 return file.getvalue()
954 finally:
/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in dump(self, obj)
265 self.inject_addons()
266 try:
--> 267 return Pickler.dump(self, obj)
268 except RuntimeError as e:
269 if 'recursion' in e.args[0]:
/anaconda3/lib/python3.7/pickle.py in dump(self, obj)
435 if self.proto >= 4:
436 self.framer.start_framing()
--> 437 self.save(obj)
438 self.write(STOP)
439 self.framer.end_framing()
/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
502 f = self.dispatch.get(t)
503 if f is not None:
--> 504 f(self, obj) # Call unbound method with explicit self
505 return
506
/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in save_instancemethod(self, obj)
716 else:
717 if PY3: # pragma: no branch
--> 718 self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
719 else:
720 self.save_reduce(types.MethodType, (obj.__func__, obj.__self__, obj.__self__.__class__),
/anaconda3/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
636 else:
637 save(func)
--> 638 save(args)
639 write(REDUCE)
640
/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
502 f = self.dispatch.get(t)
503 if f is not None:
--> 504 f(self, obj) # Call unbound method with explicit self
505 return
506
/anaconda3/lib/python3.7/pickle.py in save_tuple(self, obj)
769 if n <= 3 and self.proto >= 2:
770 for element in obj:
--> 771 save(element)
772 # Subtle. Same as in the big comment below.
773 if id(obj) in memo:
/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
547
548 # Save the reduce() output and finally memoize the object
--> 549 self.save_reduce(obj=obj, *rv)
550
551 def persistent_id(self, obj):
/anaconda3/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
660
661 if state is not None:
--> 662 save(state)
663 write(BUILD)
664
/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
502 f = self.dispatch.get(t)
503 if f is not None:
--> 504 f(self, obj) # Call unbound method with explicit self
505 return
506
/anaconda3/lib/python3.7/pickle.py in save_dict(self, obj)
854
855 self.memoize(obj)
--> 856 self._batch_setitems(obj.items())
857
858 dispatch[dict] = save_dict
/anaconda3/lib/python3.7/pickle.py in _batch_setitems(self, items)
880 for k, v in tmp:
881 save(k)
--> 882 save(v)
883 write(SETITEMS)
884 elif n:
/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
547
548 # Save the reduce() output and finally memoize the object
--> 549 self.save_reduce(obj=obj, *rv)
550
551 def persistent_id(self, obj):
/anaconda3/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
660
661 if state is not None:
--> 662 save(state)
663 write(BUILD)
664
/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
502 f = self.dispatch.get(t)
503 if f is not None:
--> 504 f(self, obj) # Call unbound method with explicit self
505 return
506
/anaconda3/lib/python3.7/pickle.py in save_dict(self, obj)
854
855 self.memoize(obj)
--> 856 self._batch_setitems(obj.items())
857
858 dispatch[dict] = save_dict
/anaconda3/lib/python3.7/pickle.py in _batch_setitems(self, items)
880 for k, v in tmp:
881 save(k)
--> 882 save(v)
883 write(SETITEMS)
884 elif n:
/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
522 reduce = getattr(obj, "__reduce_ex__", None)
523 if reduce is not None:
--> 524 rv = reduce(self.proto)
525 else:
526 reduce = getattr(obj, "__reduce__", None)
TypeError: can't pickle _thread._local objects
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (7 by maintainers)
Top Results From Across the Web
Create Dask Arrays - Dask documentation
You can load or store Dask arrays from a variety of common sources like HDF5, NetCDF, Zarr, or any format that supports NumPy-style...
Read more >How to import hdf5 data with dask in parallel and create a ...
You can pre-process the data into dataframes using dask.distributed and then convert the futures to a single dask.dataframe using ...
Read more >Reading and Writing Dask DataFrames and Arrays to HDF5
This blog post explains how to write Dask DataFrames to HDF5 files with to_hdf ... and supports distributed computations and lazy execution.
Read more >H5py objects cannot be pickled or slow processing - Dask Array
import h5py from dask.distributed import Client, LocalCluster import dask.array as da import numpy as np from scipy import signal from ...
Read more >Working with big data — HyperSpy 1.7.3 documentation
If the data is large and not loaded by HyperSpy (for example a hdf5. ... from dask_cuda import LocalCUDACluster >>> from dask.distributed import...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I still recommend the following approach: https://github.com/dask/distributed/issues/2787#issuecomment-507373886
In general, lots of things can make a function not serializable. Certainly depending on an open file handle such as you have in your example would be one reasonable cause. I think that making h5py objects serializable is outside of the scope of Dask. There are a variety of workarounds such as the example I showed, opening and closing the file every time in a task, using a nicer format for serialization, like Zarr, or using a project like Xarray. I’m going to go ahead and close this now. Again, good luck!
Okay, thank you.