Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dask hanging on long computation when used as joblib backend

See original GitHub issue

I have an embarrassingly parallel problem that I run under joblib like this:


parallel = Parallel(n_jobs=-1)
results = parallel([delayed(lengthy_function)(x, A, B, C) for x in some_list])

the parameters A, B, C are the same for all the runs, and they are already pre-computed. A typical run of lengthy_function takes about 2 minutes to finish.

When using standard joblin, my code runs just fine. However, when I try to use a dask/distributed backend like this:

with parallel_backend('dask.distributed', scheduler_host='localhost:8786'):
    parallel = Parallel(n_jobs=-1)
    results = parallel([delayed(lengthy_function)(x, A, B, C) for x in some_list])

it just hangs without giving me any results. I see some tasks getting dispatched, but nothing ever comes back. In the console where I am running the dask-worker instances I just get a warning message (repeated several times):

distributed.core - WARNING - Event loop was unresponsive for 1.82s.  This is often caused by long-runn
ing GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Is there anything I can/should do to debug or fix this issue?

Issue Analytics

State:
Created 6 years ago
Comments:11 (5 by maintainers)

Top GitHub Comments

2reactions

jlopezpenacommented, Sep 8, 2017

I have managed to track down the issue a bit further. The job actually finishes with the distributed backend, but takes much much longer than a serial version of the code or multiprocessing backend.

It seems the root of the problem is nested parallelization. I am using the distributed context and scattering my data around the outermost loop, but it seems that there are further calls to parallel contexts below. By manually removing some of them from my own code, I managed to cut down the main running time by about 30% or so, but it is still way slower than the multiprocessing joblib backend. There are still some calls to parallel stuff in deeper parts of the code (scikit-learn code).

I will keep digging.

1reaction

mrocklincommented, Sep 8, 2017

We’ve been chatting with the Joblib devs about handling the nested parallelism case. It has caused frustration for other applications as well.

When digging in similar situations I have often found the diagnostic dashboard to be of use. http://distributed.readthedocs.io/en/latest/web.html

Top Results From Across the Web

Easy distributed training with Joblib and dask - Tom Augspurger

For a while now, you've been able to use dask.distributed as a backend for joblib. This means that in most places scikit-learn offers...

Using dask distributed for single-machine parallel computing

Using dask distributed for single-machine parallel computing¶. This example shows the simplest usage of the dask distributed backend, on the local computer.

joblib Documentation - Read the Docs

Using the 'multiprocessing' backend can cause a crash when using third party libraries that manage their own native thread-pool if the library ...

Why Is Multiprocessing Slower Than Single-Core Would ...

Dask is a Python library for parallel and distributed computing that aims to fill ... Dask hanging on long computation when used as...

Embarrassingly parallel Workloads - Dask Examples

Rather than compute its result immediately, it records what we want to compute as a task into a graph that we'll run later...