question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When using a local cluster, shuffle with disk

See original GitHub issue

In feedback for https://github.com/dask/dask/pull/8223, I’ve noticed users trying to do large shuffles on single-machine clusters: https://github.com/dask/dask/pull/8223#issuecomment-961610095, https://github.com/dask/dask/issues/8294#issuecomment-961301624.

When creating any default Client, the default shuffle mode automatically gets set to "tasks": https://github.com/dask/distributed/blob/69814b4aa7459476dcefa133341b566a5ed4e24a/distributed/client.py#L727-L730

However, on a single machine, a disk-based shuffle is likely to be a lot more efficient, plus much lower load on the scheduler.

I think it would be better to keep using the disk-based shuffle if the Client is connected to a LocalCluster (not sure how to tell this). Most users don’t know about the different shuffle modes, and shouldn’t have to.

  • Client detects whether or not it is connected to a LocalCluster and sets the default shuffle config to disk
  • The detection should be based on detecting specifically LocalCluster, not on IP ranges or other means
  • Deprecation should be announced via documentation or a proper warning
  • Benchmark should exist verifying this is faster

Note:

  • This could be resolved at graph construction time when using HLG or HLE since the graph materialization is delayed.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:3
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
bsesarcommented, Nov 11, 2021

I wholeheartedly support @gjoseph92 suggestion. I use dask on a multi-core, single-node cluster which I start with Client() (i.e., LocalCluster). If I knew that I could use shuffle = 'disk' for merge(), join(), and set_index(), I think that would have spared me from numerous cluster deadlocks.

If @gjoseph92 suggestion is adopted, I would also suggest that when shuffle = 'disk' , the temporary_directory (which is used by shuffle and is defined in dask.config) defaults to local_directory (which is used by LocalCluster) to make sure we don’t run out of disk space. Perhaps we could even suggest to users somewhere in documentation that they may consider installing python-snappy package in order to compress shuffled data (and further save on disk space). That compression can be turned on with dask.config.set({"dataframe.shuffle-compression": 'Snappy'}).

0reactions
bsesarcommented, Dec 1, 2021

@fjetter and @gjoseph92, I have created a new issue (#5554) with a MVE that shows how merging is much slower when using shuffle='disk'.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Merging using shuffle = 'disk' is 25x slower than when ... - GitHub
What happened: When using shuffle = 'disk' merging took 50 minutes compared to 2 minutes when using shuffle = 'tasks' .
Read more >
Enabling and configuring the Spark shuffle service - IBM
When the shuffle service is enabled, Spark executors fetch shuffle files from the service instead of from each other.
Read more >
Introducing the Cloud Shuffle Storage Plugin for Apache Spark
In Apache Spark, shuffling happens when data needs to be redistributed across the cluster. During a shuffle, data is written to local disk ......
Read more >
Apache Spark Shuffle Service — there are more than one ...
Shuffle-service pods and executors pods that land on the same node share disk using hostpath volumes. Spark requires that each executor must ...
Read more >
Compression for Dask on disk shuffle - Stack Overflow
Currently I am working on a Dash local cluster on a set of lz4 compressed ... than my memory so I use out-of-memory...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found