Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler hangs randomly

See original GitHub issue

Hey,

I’m facing problems using parallel_backend. For some reason, the scheduler hangs randomly after few calls. I’m calling parallel_backend as follows:

with parallel_backend('dask.distributed', scheduler_host='127.0.0.1:8786'): ret = Parallel()(fd(X, s, **kwds) for s in _gen_even_slices(len(X), n_jobs))

I have 2 8-core servers where I start dask-worker as follows: dask-worker 10.1.0.4:8786 --nprocs 7 --nthreads 1 --no-bokeh They are idle at the moment the scheduler starts hanging. When I cancel the python call after it hangs, the following stacktrace is printed:

Traceback (most recent call last): File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/site-packages/joblib/parallel.py”, line 684, in retrieve ^C File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/site-packages/distributed/client.py”, line 110, in result result = sync(self.client.loop, self._result, raiseit=False) File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/site-packages/distributed/utils.py”, line 161, in sync e.wait(1000000) File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/threading.py”, line 549, in wait signaled = self._cond.wait(timeout) File “/home/christian/.conda/envs/nedtrain35/lib/python3.5/threading.py”, line 297, in wait gotit = waiter.acquire(True, timeout) KeyboardInterrupt

The following versions are installed:

dask 0.13.0 py35_0 conda-forge
distributed 1.15.0 py35_0 conda-forge
tornado 4.4.1 py35_0

Issue Analytics

State:
Created 7 years ago
Comments:11 (5 by maintainers)

Top GitHub Comments

1reaction

csdenboercommented, Jan 10, 2017

It’s pretty hard to come up with a small reproducible example. Is there somewhere a log where I can see why a worker failed?

0reactions

gouthambscommented, May 30, 2017

@mrocklin Good to know that you are able to replicate on your end. This is one example of the error I have encountered. I have come across similar issues where the worker/scheduler hangs because of some error in the code. Hopefully the fix will be generic enough to catch other such issues.