question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

H5py and Dask Distributed

See original GitHub issue

When I use Client to map a function to a Dask array made from an HDF5, the following error appears:

TypeError: can't pick _thread._local objects

Here is a simplified version of what I am trying to do:

import h5py
import numpy as np
from dask.distributed import Client

h5_f = h5py.File(h5_path, mode='r+')

client = Client()

#random 2d h5py dataset into Dask Array
arr = np.arange(100).reshape((10,10))
dset = h5_f.create_dataset("MyDataset", data=arr)
y = da.from_array(dset, chunks='auto')

#some function
def inc(x):
    return x + 1

#client maps function, inc(), to dataset, y
#where error appears
L = client.map(inc, y)

#results
results = client.gather(L)

After some testing, I believe the issue to lay with HDF in a lazy dask array function, which perhaps is not pickle-able when used in the map() function.

I am trying to implement Dask into the pyUSID Python package, which is built on h5py, for spectroscopy and imaging computation. Therefore, I need to use Dask with HDF.

I am using Python person 3.7.3 on a MacBook Air with a 1.8 GHz Intel Core i7 (4-core) processor and 4 gb RAM.

Here is the traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/distributed/worker.py in dumps_function(func)
   2728     try:
-> 2729         result = cache[func]
   2730     except KeyError:

KeyError: <bound method SignalFilter._unit_computation of <dask_signal_filter.SignalFilter object at 0xa1579d128>>

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/distributed/protocol/pickle.py in dumps(x)
     37     try:
---> 38         result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
     39         if len(result) < 1000:

TypeError: can't pickle _thread._local objects

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-10-b67304d97a62> in <module>
     10 #L = client.map(sig_filt._unit_computation, sig_filt.data)
     11 #L
---> 12 h5_filt_grp = sig_filt.compute(override=True)
     13 #sig_filt.data

~/Downloads/daskUSID/signal_filter/dask_process.py in compute(self, override, *args, **kwargs)
    414             print('Scheduler info: {}'.format(client.scheduler_info()))
    415 
--> 416         L = client.map(self._unit_computation, self.data, *args, **kwargs)
    417         if self.verbose:
    418             progress(L)

/anaconda3/lib/python3.7/site-packages/distributed/client.py in map(self, func, *iterables, **kwargs)
   1437                                          user_priority=user_priority,
   1438                                          fifo_timeout=fifo_timeout,
-> 1439                                          actors=actor)
   1440         logger.debug("map(%s, ...)", funcname(func))
   1441 

/anaconda3/lib/python3.7/site-packages/distributed/client.py in _graph_to_futures(self, dsk, keys, restrictions, loose_restrictions, priority, user_priority, resources, retries, fifo_timeout, actors)
   2259 
   2260             self._send_to_scheduler({'op': 'update-graph',
-> 2261                                      'tasks': valmap(dumps_task, dsk3),
   2262                                      'dependencies': dependencies,
   2263                                      'keys': list(flatkeys),

/anaconda3/lib/python3.7/site-packages/cytoolz/dicttoolz.pyx in cytoolz.dicttoolz.valmap()

/anaconda3/lib/python3.7/site-packages/cytoolz/dicttoolz.pyx in cytoolz.dicttoolz.valmap()

/anaconda3/lib/python3.7/site-packages/distributed/worker.py in dumps_task(task)
   2765             return d
   2766         elif not any(map(_maybe_complex, task[1:])):
-> 2767             return {'function': dumps_function(task[0]),
   2768                     'args': warn_dumps(task[1:])}
   2769     return to_serialize(task)

/anaconda3/lib/python3.7/site-packages/distributed/worker.py in dumps_function(func)
   2729         result = cache[func]
   2730     except KeyError:
-> 2731         result = pickle.dumps(func)
   2732         if len(result) < 100000:
   2733             cache[func] = result

/anaconda3/lib/python3.7/site-packages/distributed/protocol/pickle.py in dumps(x)
     49     except Exception:
     50         try:
---> 51             return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
     52         except Exception as e:
     53             logger.info("Failed to serialize %s. Exception: %s", x, e)

/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in dumps(obj, protocol)
    950     try:
    951         cp = CloudPickler(file, protocol=protocol)
--> 952         cp.dump(obj)
    953         return file.getvalue()
    954     finally:

/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in dump(self, obj)
    265         self.inject_addons()
    266         try:
--> 267             return Pickler.dump(self, obj)
    268         except RuntimeError as e:
    269             if 'recursion' in e.args[0]:

/anaconda3/lib/python3.7/pickle.py in dump(self, obj)
    435         if self.proto >= 4:
    436             self.framer.start_framing()
--> 437         self.save(obj)
    438         self.write(STOP)
    439         self.framer.end_framing()

/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 

/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in save_instancemethod(self, obj)
    716         else:
    717             if PY3:  # pragma: no branch
--> 718                 self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
    719             else:
    720                 self.save_reduce(types.MethodType, (obj.__func__, obj.__self__, obj.__self__.__class__),

/anaconda3/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    636         else:
    637             save(func)
--> 638             save(args)
    639             write(REDUCE)
    640 

/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 

/anaconda3/lib/python3.7/pickle.py in save_tuple(self, obj)
    769         if n <= 3 and self.proto >= 2:
    770             for element in obj:
--> 771                 save(element)
    772             # Subtle.  Same as in the big comment below.
    773             if id(obj) in memo:

/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    547 
    548         # Save the reduce() output and finally memoize the object
--> 549         self.save_reduce(obj=obj, *rv)
    550 
    551     def persistent_id(self, obj):

/anaconda3/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    660 
    661         if state is not None:
--> 662             save(state)
    663             write(BUILD)
    664 

/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 

/anaconda3/lib/python3.7/pickle.py in save_dict(self, obj)
    854 
    855         self.memoize(obj)
--> 856         self._batch_setitems(obj.items())
    857 
    858     dispatch[dict] = save_dict

/anaconda3/lib/python3.7/pickle.py in _batch_setitems(self, items)
    880                 for k, v in tmp:
    881                     save(k)
--> 882                     save(v)
    883                 write(SETITEMS)
    884             elif n:

/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    547 
    548         # Save the reduce() output and finally memoize the object
--> 549         self.save_reduce(obj=obj, *rv)
    550 
    551     def persistent_id(self, obj):

/anaconda3/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    660 
    661         if state is not None:
--> 662             save(state)
    663             write(BUILD)
    664 

/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 

/anaconda3/lib/python3.7/pickle.py in save_dict(self, obj)
    854 
    855         self.memoize(obj)
--> 856         self._batch_setitems(obj.items())
    857 
    858     dispatch[dict] = save_dict

/anaconda3/lib/python3.7/pickle.py in _batch_setitems(self, items)
    880                 for k, v in tmp:
    881                     save(k)
--> 882                     save(v)
    883                 write(SETITEMS)
    884             elif n:

/anaconda3/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    522             reduce = getattr(obj, "__reduce_ex__", None)
    523             if reduce is not None:
--> 524                 rv = reduce(self.proto)
    525             else:
    526                 reduce = getattr(obj, "__reduce__", None)

TypeError: can't pickle _thread._local objects

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
mrocklincommented, Jul 1, 2019

I still recommend the following approach: https://github.com/dask/distributed/issues/2787#issuecomment-507373886

Interesting. I don’t know then. Perhaps work from my example (verify that it works first) and then change things towards yours until something breaks? Maybe that helps to narrow things down?

In general, lots of things can make a function not serializable. Certainly depending on an open file handle such as you have in your example would be one reasonable cause. I think that making h5py objects serializable is outside of the scope of Dask. There are a variety of workarounds such as the example I showed, opening and closing the file every time in a task, using a nicer format for serialization, like Zarr, or using a project like Xarray. I’m going to go ahead and close this now. Again, good luck!

0reactions
emilyjcosta5commented, Jul 1, 2019

Okay, thank you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Create Dask Arrays - Dask documentation
You can load or store Dask arrays from a variety of common sources like HDF5, NetCDF, Zarr, or any format that supports NumPy-style...
Read more >
How to import hdf5 data with dask in parallel and create a ...
You can pre-process the data into dataframes using dask.distributed and then convert the futures to a single dask.dataframe using ...
Read more >
Reading and Writing Dask DataFrames and Arrays to HDF5
This blog post explains how to write Dask DataFrames to HDF5 files with to_hdf ... and supports distributed computations and lazy execution.
Read more >
H5py objects cannot be pickled or slow processing - Dask Array
import h5py from dask.distributed import Client, LocalCluster import dask.array as da import numpy as np from scipy import signal from ...
Read more >
Working with big data — HyperSpy 1.7.3 documentation
If the data is large and not loaded by HyperSpy (for example a hdf5. ... from dask_cuda import LocalCUDACluster >>> from dask.distributed import...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found