question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unlucky workload balance after Bag.groupby()

See original GitHub issue

[Disclaimer: I’m not 100% sure I’m interpreting these results correctly, so let me know if this is, in fact, expected behavior.]

In certain “unlucky” (but maybe not uncommon) scenarios, Bag.groupby() can lead to very unbalanced workers in a cluster.

Let’s start with a simple test case that works fine. I’ll create a bag of 100 ints, and then “group” them trivially – every element is in a group by itself. Then I’ll use get_worker() to see how the results are balanced across my cluster of 30 workers.

Good grouping example
In [1]: from distributed import Client
   ...: from distributed.worker import get_worker
   ...: from dask_jobqueue import LSFCluster
   ...: from toolz import frequencies
   ...: import dask.bag as db
   ...:
   ...:

In [2]: cluster = LSFCluster(cores=1, processes=1, memory='15GB', walltime='01:00', log_directory='worker-logs')
   ...: cluster.scale(30)
   ...: client = Client(cluster)
   ...:
   ...:

In [3]: bag = db.from_sequence(np.arange(100))

In [4]: groups = bag.groupby(lambda x: x)

In [5]: workers = groups.map(lambda t: get_worker().address).compute()

In [6]: frequencies(workers)
Out[6]:
{'tcp://10.36.111.32:40190': 3,
 'tcp://10.36.111.36:33107': 3,
 'tcp://10.36.111.39:37165': 3,
 'tcp://10.36.110.32:40587': 4,
 'tcp://10.36.111.33:45234': 3,
 'tcp://10.36.111.42:41962': 4,
 'tcp://10.36.110.40:44968': 4,
 'tcp://10.36.111.36:39365': 3,
 'tcp://10.36.111.23:37433': 3,
 'tcp://10.36.111.32:36918': 3,
 'tcp://10.36.111.27:37787': 3,
 'tcp://10.36.111.21:45552': 4,
 'tcp://10.36.111.36:35714': 5,
 'tcp://10.36.110.32:35934': 3,
 'tcp://10.36.110.38:44253': 4,
 'tcp://10.36.111.27:46136': 4,
 'tcp://10.36.110.35:37716': 4,
 'tcp://10.36.111.32:43984': 4,
 'tcp://10.36.111.18:40827': 2,
 'tcp://10.36.111.21:33749': 4,
 'tcp://10.36.110.32:40599': 3,
 'tcp://10.36.111.16:46645': 4,
 'tcp://10.36.111.18:39199': 3,
 'tcp://10.36.110.40:41659': 3,
 'tcp://10.36.110.38:38503': 3,
 'tcp://10.36.110.38:38701': 3,
 'tcp://10.36.111.21:33351': 3,
 'tcp://10.36.111.39:44899': 3,
 'tcp://10.36.110.40:32966': 3,
 'tcp://10.36.110.35:42767': 2}

In [7]: len(_)
Out[7]: 30

So far, so good. All 30 workers end up with some data. The balance isn’t perfect, but it’s reasonable.

Now let’s make a tiny change. This time, I’ll multiply each int by 64:

Bad grouping example
In [8]: bag = db.from_sequence(64*np.arange(100))

In [9]: groups = bag.groupby(lambda x: x)

In [10]: workers = groups.map(lambda t: get_worker().address).compute()

In [11]: frequencies(workers)
Out[11]:
{'tcp://10.36.111.27:37787': 12,
 'tcp://10.36.110.40:44968': 4,
 'tcp://10.36.110.38:38701': 8,
 'tcp://10.36.111.36:33107': 12,
 'tcp://10.36.110.32:35934': 8,
 'tcp://10.36.111.39:37165': 8,
 'tcp://10.36.111.21:33749': 8,
 'tcp://10.36.111.21:33351': 4,
 'tcp://10.36.111.32:43984': 4,
 'tcp://10.36.111.18:39199': 8,
 'tcp://10.36.111.39:44899': 4,
 'tcp://10.36.111.33:45234': 4,
 'tcp://10.36.110.32:40599': 4,
 'tcp://10.36.111.36:35714': 4,
 'tcp://10.36.111.32:40190': 4,
 'tcp://10.36.110.32:40587': 4}

In [12]: len(_)
Out[12]: 16

As you can see, nearly half of my workers ended up with no data at all, a few of the others have 3x their “fair share” of data. In my real-world code, I’m manipulating large images, and this unfortunate balancing causes my workers to run out of RAM (or at least, I think that’s what’s happening).

I wonder if this has something to do with the hash() function.

FWIW, I’ve seen a similar issue in pyspark before, too, though in that case it resulted in WAY worse balancing. I vaguely recall them doing something to workaround it, but I’m having trouble finding mention of it in their issue tracker.

In the pyspark case, the problem occurred when using 2^N workers to group tuples whose elements were offset by a power of two. It boils down to a surprising (to me) property of the python hash function: Although the data in these tuples hash to different values, they’re all the same after mod-by-2^N, so pyspark would hash all these tuples to a single worker.

Python hash demo (simulate pyspark worker assignments)
In [1]: tuples = [(i,) for i in 64*np.arange(5)]
   ...:
   ...: hashes = [hash(t) for t in tuples]
   ...: print('hashes:\n', hashes, '\n')
   ...:
   ...: # Use modulo of hash to assign worker
   ...: WORKER_COUNT = 32
   ...: worker_ids = [h % WORKER_COUNT for h in hashes]
   ...: print('worker assignments:\n', worker_ids)
   ...:
hashes:
 [3430018387555, 3429954387363, 3430146387939, 3430082387747, 3430274388323]

worker assignments:
 [3, 3, 3, 3, 3]

Clearly, the “bad balancing” in the dask example above isn’t THAT bad, so maybe I’m on the wrong track with this hashing theory.

Interestingly, dask.DataFrame.groupby() does NOT exhibit the bad balance that Bag.groupby() exhibits:

DataFrame Example
In [62]: df = ddf.from_pandas( pd.DataFrame({'x': 64*np.arange(100)}), npartitions=100 ).set_index('x')

In [63]: workers = df.groupby('x')\
        .apply(lambda _: pd.DataFrame({'worker': [get_worker().address]}),\
               meta=pd.DataFrame({'worker': ['']}))\
        .compute()['worker']

In [64]: frequencies(workers)
Out[64]:
{'tcp://10.36.111.27:37787': 4,
 'tcp://10.36.111.18:39199': 3,
 'tcp://10.36.111.36:39365': 5,
 'tcp://10.36.110.35:37716': 3,
 'tcp://10.36.111.16:46645': 3,
 'tcp://10.36.111.39:44899': 3,
 'tcp://10.36.110.32:35934': 3,
 'tcp://10.36.111.42:41962': 3,
 'tcp://10.36.110.38:38701': 3,
 'tcp://10.36.110.38:38503': 3,
 'tcp://10.36.111.21:33351': 3,
 'tcp://10.36.110.32:40587': 3,
 'tcp://10.36.110.32:40599': 3,
 'tcp://10.36.110.35:42767': 3,
 'tcp://10.36.110.38:44253': 3,
 'tcp://10.36.110.40:32966': 3,
 'tcp://10.36.110.40:41659': 3,
 'tcp://10.36.110.40:44968': 3,
 'tcp://10.36.111.18:40827': 3,
 'tcp://10.36.111.21:33749': 3,
 'tcp://10.36.111.21:45552': 3,
 'tcp://10.36.111.23:37433': 3,
 'tcp://10.36.111.27:46136': 4,
 'tcp://10.36.111.32:36918': 4,
 'tcp://10.36.111.32:40190': 4,
 'tcp://10.36.111.32:43984': 4,
 'tcp://10.36.111.33:45234': 4,
 'tcp://10.36.111.36:33107': 4,
 'tcp://10.36.111.36:35714': 4,
 'tcp://10.36.111.39:37165': 3}

In [65]: len(_)
Out[65]: 30

I’m not yet familiar enough with the scheduler internals to know which code to look at to understand how workers are chosen for groupby() results.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Sep 25, 2019

I had hoped that the tasks would be evenly distributed to my 124 workers, with each one being given When I started the job, all tasks were assigned to a single worker!

This seems different from this issue. Dask doesn’t determine data location by hashing it. A new issue with an MCVE would be welcome.

0reactions
stuartebergcommented, Sep 26, 2019

Related, I think: Lately I’m seeing an even more exaggerated version of this behavior. In this case, the bad balance occurs before I even use groupby(). All I’m using here is from_sequence() and map_partitions().

Here are the stats of the job I’m running right now:

Bag size (total items): 511 num_partitions: 257 number of dask workers: 124

I had hoped that the tasks would be evenly distributed to my 124 workers, with each one being given a single partition. When I started the job, all tasks were assigned to a single worker! A minute later, I took the following screenshots. By that point, some of the tasks had been given to one other worker: But the balance is still quite bad. Most of my workers are sitting idle, while only two are processing any data.

image

Any tips on how to debug this behavior and achieve a better balance would be helpful.

I know that I should switch to DataFrames (or Arrays) to benefit from improved hashing (as mentioned above), but in this case I don’t think hashing is responsible. For instance, I’m specifically avoiding powers-of-two when choosing my partition size and number of workers, so I think ordinary Python hashing ought to be well-behaved in this instance.

FWIW, the specific workload here is fairly simple, and not very compute-intensive: Essentially, each partition consists of a bag of integers, and loading a single PNG file with the corresponding name (0.png, 1.png, 2.png, etc.). At this stage, most of the time is spent on disk I/O.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Workload Analysis of a Large-Scale Key-Value Store
paper offers the following key contributions and findings: 1. A workload decomposition of the traces that shows how different applications of Memcached can ......
Read more >
Parallelism - Erik Darling Data
This post reviews some of the possible explanations for those variations. ... course a balancing act, but it's one that you must perform...
Read more >
Slow in the Application, Fast in SSMS? - Erland Sommarskog
Getting the Query Plans and Parameters with Management Studio ... no matter if the dynamic SQL is executed through EXEC() or sp_executesql.
Read more >
CS-7646 - Machine Learning for Trading | OMSCS Reviews
The projects are fairly interesting after all the unnecessary fluff. ... The interactions on Ed Discussion with course management saves this class from ......
Read more >
Berkeley Databases Prelim Notes - Shreya Shankar
2.1 Access Path Selection in a Relational Database Management System [47] . ... Upon receiving the client's first SQL command, the DBMS assigns...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found