Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When using a local cluster, shuffle with disk

See original GitHub issue

In feedback for https://github.com/dask/dask/pull/8223, I’ve noticed users trying to do large shuffles on single-machine clusters: https://github.com/dask/dask/pull/8223#issuecomment-961610095, https://github.com/dask/dask/issues/8294#issuecomment-961301624.

When creating any default Client, the default shuffle mode automatically gets set to "tasks": https://github.com/dask/distributed/blob/69814b4aa7459476dcefa133341b566a5ed4e24a/distributed/client.py#L727-L730

However, on a single machine, a disk-based shuffle is likely to be a lot more efficient, plus much lower load on the scheduler.

I think it would be better to keep using the disk-based shuffle if the Client is connected to a LocalCluster (not sure how to tell this). Most users don’t know about the different shuffle modes, and shouldn’t have to.

Client detects whether or not it is connected to a LocalCluster and sets the default shuffle config to disk
The detection should be based on detecting specifically LocalCluster, not on IP ranges or other means
Deprecation should be announced via documentation or a proper warning
Benchmark should exist verifying this is faster

Note:

This could be resolved at graph construction time when using HLG or HLE since the graph materialization is delayed.

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

bsesarcommented, Nov 11, 2021

I wholeheartedly support @gjoseph92 suggestion. I use dask on a multi-core, single-node cluster which I start with Client() (i.e., LocalCluster). If I knew that I could use shuffle = 'disk' for merge(), join(), and set_index(), I think that would have spared me from numerous cluster deadlocks.

If @gjoseph92 suggestion is adopted, I would also suggest that when shuffle = 'disk' , the temporary_directory (which is used by shuffle and is defined in dask.config) defaults to local_directory (which is used by LocalCluster) to make sure we don’t run out of disk space. Perhaps we could even suggest to users somewhere in documentation that they may consider installing python-snappy package in order to compress shuffled data (and further save on disk space). That compression can be turned on with dask.config.set({"dataframe.shuffle-compression": 'Snappy'}).

0reactions

bsesarcommented, Dec 1, 2021

@fjetter and @gjoseph92, I have created a new issue (#5554) with a MVE that shows how merging is much slower when using shuffle='disk'.