Dask backend auto-scattering overloads scheduler memory
See original GitHub issueCross-posting from https://github.com/dask/dask-ml/issues/789, but I think the better home for this might be here
I seem to be running into problems using the Dask distributed backend for joblib with scikit-learn classes. This notebook has the full reproducible example using a FargateCluster
. The issue does not happen with a LocalCluster
: https://nbviewer.jupyter.org/gist/rikturr/66427bd13e692726044b4903a790f013
this part fails with ~50MB data size:
with joblib.parallel_backend('dask'):
search.fit(data, target)
It also fails if I scatter the object before-hand:
client.scatter([data, target])
with joblib.parallel_backend('dask'):
search.fit(data, target)
It works properly if I manually scatter within parallel_backend
like so:
with joblib.parallel_backend('dask', scatter=[data, target]):
search.fit(data, target)
This leads me to believe something is happening with the auto-scattering causing lots of memory to be passed through the scheduler at once
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Managing Memory — Dask.distributed 2022.12.1 documentation
The central scheduler tracks all data on the cluster and determines when data should be freed. Completed results are usually cleared from memory...
Read more >Chunk data and parallelize computation - icclim - Read the Docs
This means we must either rechunk in memory to have an optimized chunking. However, this generates many dask tasks and can overload dask...
Read more >Optimize running large number of tasks using Dask - Qxf2 BLOG
Added more workers to coiled (16 from 8), increased CPUs and memory (4 CPUs and 30GB from 1 CPU and 8GB), expanded Scheduler...
Read more >Data Won't Fit in Memory? Parallel Computing with Dask to ...
If a workers crashes or becomes overloaded mid-computation, the scheduler can compensate and re-allocate tasks to functional workers.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I released joblib 1.0.1 with this fix. Closing.
Just tested - master is working great! Tried it out with a 5GB object and the scheduler gets up to 5GB for a few seconds then back down once the workers take over. I was previously on 1.0 so something that happened in master since seems to have fixed it 😃