Dask hanging on long computation when used as joblib backend
See original GitHub issueI have an embarrassingly parallel problem that I run under joblib like this:
parallel = Parallel(n_jobs=-1)
results = parallel([delayed(lengthy_function)(x, A, B, C) for x in some_list])
the parameters A, B, C
are the same for all the runs, and they are already pre-computed. A typical run of lengthy_function
takes about 2 minutes to finish.
When using standard joblin, my code runs just fine. However, when I try to use a dask/distributed backend like this:
with parallel_backend('dask.distributed', scheduler_host='localhost:8786'):
parallel = Parallel(n_jobs=-1)
results = parallel([delayed(lengthy_function)(x, A, B, C) for x in some_list])
it just hangs without giving me any results. I see some tasks getting dispatched, but nothing ever comes back. In the console where I am running the dask-worker
instances I just get a warning message (repeated several times):
distributed.core - WARNING - Event loop was unresponsive for 1.82s. This is often caused by long-runn
ing GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Is there anything I can/should do to debug or fix this issue?
Issue Analytics
- State:
- Created 6 years ago
- Comments:11 (5 by maintainers)
Top Results From Across the Web
Easy distributed training with Joblib and dask - Tom Augspurger
For a while now, you've been able to use dask.distributed as a backend for joblib. This means that in most places scikit-learn offers...
Read more >Using dask distributed for single-machine parallel computing
Using dask distributed for single-machine parallel computing¶. This example shows the simplest usage of the dask distributed backend, on the local computer.
Read more >joblib Documentation - Read the Docs
Using the 'multiprocessing' backend can cause a crash when using third party libraries that manage their own native thread-pool if the library ...
Read more >Why Is Multiprocessing Slower Than Single-Core Would ...
Dask is a Python library for parallel and distributed computing that aims to fill ... Dask hanging on long computation when used as...
Read more >Embarrassingly parallel Workloads - Dask Examples
Rather than compute its result immediately, it records what we want to compute as a task into a graph that we'll run later...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I have managed to track down the issue a bit further. The job actually finishes with the distributed backend, but takes much much longer than a serial version of the code or multiprocessing backend.
It seems the root of the problem is nested parallelization. I am using the distributed context and scattering my data around the outermost loop, but it seems that there are further calls to parallel contexts below. By manually removing some of them from my own code, I managed to cut down the main running time by about 30% or so, but it is still way slower than the multiprocessing joblib backend. There are still some calls to parallel stuff in deeper parts of the code (scikit-learn code).
I will keep digging.
We’ve been chatting with the Joblib devs about handling the nested parallelism case. It has caused frustration for other applications as well.
When digging in similar situations I have often found the diagnostic dashboard to be of use. http://distributed.readthedocs.io/en/latest/web.html