When using a local cluster, shuffle with disk
See original GitHub issueIn feedback for https://github.com/dask/dask/pull/8223, I’ve noticed users trying to do large shuffles on single-machine clusters: https://github.com/dask/dask/pull/8223#issuecomment-961610095, https://github.com/dask/dask/issues/8294#issuecomment-961301624.
When creating any default Client, the default shuffle mode automatically gets set to "tasks"
:
https://github.com/dask/distributed/blob/69814b4aa7459476dcefa133341b566a5ed4e24a/distributed/client.py#L727-L730
However, on a single machine, a disk-based shuffle is likely to be a lot more efficient, plus much lower load on the scheduler.
I think it would be better to keep using the disk-based shuffle if the Client is connected to a LocalCluster (not sure how to tell this). Most users don’t know about the different shuffle modes, and shouldn’t have to.
Client
detects whether or not it is connected to aLocalCluster
and sets the default shuffle config to disk- The detection should be based on detecting specifically LocalCluster, not on IP ranges or other means
- Deprecation should be announced via documentation or a proper warning
- Benchmark should exist verifying this is faster
Note:
- This could be resolved at graph construction time when using HLG or HLE since the graph materialization is delayed.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:7 (2 by maintainers)
I wholeheartedly support @gjoseph92 suggestion. I use
dask
on a multi-core, single-node cluster which I start withClient()
(i.e.,LocalCluster
). If I knew that I could useshuffle = 'disk'
formerge()
,join()
, andset_index()
, I think that would have spared me from numerous cluster deadlocks.If @gjoseph92 suggestion is adopted, I would also suggest that when
shuffle = 'disk'
, thetemporary_directory
(which is used byshuffle
and is defined in dask.config) defaults tolocal_directory
(which is used byLocalCluster
) to make sure we don’t run out of disk space. Perhaps we could even suggest to users somewhere in documentation that they may consider installingpython-snappy
package in order to compress shuffled data (and further save on disk space). That compression can be turned on withdask.config.set({"dataframe.shuffle-compression": 'Snappy'})
.@fjetter and @gjoseph92, I have created a new issue (#5554) with a MVE that shows how merging is much slower when using
shuffle='disk'
.