Scheduler hangs randomly
See original GitHub issueHey,
I’m facing problems using parallel_backend. For some reason, the scheduler hangs randomly after few calls. I’m calling parallel_backend as follows:
with parallel_backend('dask.distributed', scheduler_host='127.0.0.1:8786'): ret = Parallel()(fd(X, s, **kwds) for s in _gen_even_slices(len(X), n_jobs))
I have 2 8-core servers where I start dask-worker as follows:
dask-worker 10.1.0.4:8786 --nprocs 7 --nthreads 1 --no-bokeh
They are idle at the moment the scheduler starts hanging. When I cancel the python call after it hangs, the following stacktrace is printed:
Traceback (most recent call last): File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/site-packages/joblib/parallel.py”, line 684, in retrieve ^C File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/site-packages/distributed/client.py”, line 110, in result result = sync(self.client.loop, self._result, raiseit=False) File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/site-packages/distributed/utils.py”, line 161, in sync e.wait(1000000) File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/threading.py”, line 549, in wait signaled = self._cond.wait(timeout) File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/threading.py”, line 297, in wait gotit = waiter.acquire(True, timeout) KeyboardInterrupt
The following versions are installed:
- dask 0.13.0 py35_0 conda-forge
- distributed 1.15.0 py35_0 conda-forge
- tornado 4.4.1 py35_0
Issue Analytics
- State:
- Created 7 years ago
- Comments:11 (5 by maintainers)
Top GitHub Comments
It’s pretty hard to come up with a small reproducible example. Is there somewhere a log where I can see why a worker failed?
@mrocklin Good to know that you are able to replicate on your end. This is one example of the error I have encountered. I have come across similar issues where the worker/scheduler hangs because of some error in the code. Hopefully the fix will be generic enough to catch other such issues.