question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dask hanging on long computation when used as joblib backend

See original GitHub issue

I have an embarrassingly parallel problem that I run under joblib like this:


parallel = Parallel(n_jobs=-1)
results = parallel([delayed(lengthy_function)(x, A, B, C) for x in some_list])

the parameters A, B, C are the same for all the runs, and they are already pre-computed. A typical run of lengthy_function takes about 2 minutes to finish.

When using standard joblin, my code runs just fine. However, when I try to use a dask/distributed backend like this:

with parallel_backend('dask.distributed', scheduler_host='localhost:8786'):
    parallel = Parallel(n_jobs=-1)
    results = parallel([delayed(lengthy_function)(x, A, B, C) for x in some_list])

it just hangs without giving me any results. I see some tasks getting dispatched, but nothing ever comes back. In the console where I am running the dask-worker instances I just get a warning message (repeated several times):

distributed.core - WARNING - Event loop was unresponsive for 1.82s.  This is often caused by long-runn
ing GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Is there anything I can/should do to debug or fix this issue?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jlopezpenacommented, Sep 8, 2017

I have managed to track down the issue a bit further. The job actually finishes with the distributed backend, but takes much much longer than a serial version of the code or multiprocessing backend.

It seems the root of the problem is nested parallelization. I am using the distributed context and scattering my data around the outermost loop, but it seems that there are further calls to parallel contexts below. By manually removing some of them from my own code, I managed to cut down the main running time by about 30% or so, but it is still way slower than the multiprocessing joblib backend. There are still some calls to parallel stuff in deeper parts of the code (scikit-learn code).

I will keep digging.

1reaction
mrocklincommented, Sep 8, 2017

We’ve been chatting with the Joblib devs about handling the nested parallelism case. It has caused frustration for other applications as well.

When digging in similar situations I have often found the diagnostic dashboard to be of use. http://distributed.readthedocs.io/en/latest/web.html

Read more comments on GitHub >

github_iconTop Results From Across the Web

Easy distributed training with Joblib and dask - Tom Augspurger
For a while now, you've been able to use dask.distributed as a backend for joblib. This means that in most places scikit-learn offers...
Read more >
Using dask distributed for single-machine parallel computing
Using dask distributed for single-machine parallel computing¶. This example shows the simplest usage of the dask distributed backend, on the local computer.
Read more >
joblib Documentation - Read the Docs
Using the 'multiprocessing' backend can cause a crash when using third party libraries that manage their own native thread-pool if the library ...
Read more >
Why Is Multiprocessing Slower Than Single-Core Would ...
Dask is a Python library for parallel and distributed computing that aims to fill ... Dask hanging on long computation when used as...
Read more >
Embarrassingly parallel Workloads - Dask Examples
Rather than compute its result immediately, it records what we want to compute as a task into a graph that we'll run later...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found