Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unlucky workload balance after Bag.groupby()

See original GitHub issue

[Disclaimer: I’m not 100% sure I’m interpreting these results correctly, so let me know if this is, in fact, expected behavior.]

In certain “unlucky” (but maybe not uncommon) scenarios, Bag.groupby() can lead to very unbalanced workers in a cluster.

Let’s start with a simple test case that works fine. I’ll create a bag of 100 ints, and then “group” them trivially – every element is in a group by itself. Then I’ll use get_worker() to see how the results are balanced across my cluster of 30 workers.

Good grouping example

In [1]: from distributed import Client
   ...: from distributed.worker import get_worker
   ...: from dask_jobqueue import LSFCluster
   ...: from toolz import frequencies
   ...: import dask.bag as db
   ...:
   ...:

In [2]: cluster = LSFCluster(cores=1, processes=1, memory='15GB', walltime='01:00', log_directory='worker-logs')
   ...: cluster.scale(30)
   ...: client = Client(cluster)
   ...:
   ...:

In [3]: bag = db.from_sequence(np.arange(100))

In [4]: groups = bag.groupby(lambda x: x)

In [5]: workers = groups.map(lambda t: get_worker().address).compute()

In [6]: frequencies(workers)
Out[6]:
{'tcp://10.36.111.32:40190': 3,
 'tcp://10.36.111.36:33107': 3,
 'tcp://10.36.111.39:37165': 3,
 'tcp://10.36.110.32:40587': 4,
 'tcp://10.36.111.33:45234': 3,
 'tcp://10.36.111.42:41962': 4,
 'tcp://10.36.110.40:44968': 4,
 'tcp://10.36.111.36:39365': 3,
 'tcp://10.36.111.23:37433': 3,
 'tcp://10.36.111.32:36918': 3,
 'tcp://10.36.111.27:37787': 3,
 'tcp://10.36.111.21:45552': 4,
 'tcp://10.36.111.36:35714': 5,
 'tcp://10.36.110.32:35934': 3,
 'tcp://10.36.110.38:44253': 4,
 'tcp://10.36.111.27:46136': 4,
 'tcp://10.36.110.35:37716': 4,
 'tcp://10.36.111.32:43984': 4,
 'tcp://10.36.111.18:40827': 2,
 'tcp://10.36.111.21:33749': 4,
 'tcp://10.36.110.32:40599': 3,
 'tcp://10.36.111.16:46645': 4,
 'tcp://10.36.111.18:39199': 3,
 'tcp://10.36.110.40:41659': 3,
 'tcp://10.36.110.38:38503': 3,
 'tcp://10.36.110.38:38701': 3,
 'tcp://10.36.111.21:33351': 3,
 'tcp://10.36.111.39:44899': 3,
 'tcp://10.36.110.40:32966': 3,
 'tcp://10.36.110.35:42767': 2}

In [7]: len(_)
Out[7]: 30

So far, so good. All 30 workers end up with some data. The balance isn’t perfect, but it’s reasonable.

Now let’s make a tiny change. This time, I’ll multiply each int by 64:

Bad grouping example

In [8]: bag = db.from_sequence(64*np.arange(100))

In [9]: groups = bag.groupby(lambda x: x)

In [10]: workers = groups.map(lambda t: get_worker().address).compute()

In [11]: frequencies(workers)
Out[11]:
{'tcp://10.36.111.27:37787': 12,
 'tcp://10.36.110.40:44968': 4,
 'tcp://10.36.110.38:38701': 8,
 'tcp://10.36.111.36:33107': 12,
 'tcp://10.36.110.32:35934': 8,
 'tcp://10.36.111.39:37165': 8,
 'tcp://10.36.111.21:33749': 8,
 'tcp://10.36.111.21:33351': 4,
 'tcp://10.36.111.32:43984': 4,
 'tcp://10.36.111.18:39199': 8,
 'tcp://10.36.111.39:44899': 4,
 'tcp://10.36.111.33:45234': 4,
 'tcp://10.36.110.32:40599': 4,
 'tcp://10.36.111.36:35714': 4,
 'tcp://10.36.111.32:40190': 4,
 'tcp://10.36.110.32:40587': 4}

In [12]: len(_)
Out[12]: 16

As you can see, nearly half of my workers ended up with no data at all, a few of the others have 3x their “fair share” of data. In my real-world code, I’m manipulating large images, and this unfortunate balancing causes my workers to run out of RAM (or at least, I think that’s what’s happening).

I wonder if this has something to do with the hash() function.

FWIW, I’ve seen a similar issue in pyspark before, too, though in that case it resulted in WAY worse balancing. I vaguely recall them doing something to workaround it, but I’m having trouble finding mention of it in their issue tracker.

In the pyspark case, the problem occurred when using 2^N workers to group tuples whose elements were offset by a power of two. It boils down to a surprising (to me) property of the python hash function: Although the data in these tuples hash to different values, they’re all the same after mod-by-2^N, so pyspark would hash all these tuples to a single worker.

Python hash demo (simulate pyspark worker assignments)

In [1]: tuples = [(i,) for i in 64*np.arange(5)]
   ...:
   ...: hashes = [hash(t) for t in tuples]
   ...: print('hashes:\n', hashes, '\n')
   ...:
   ...: # Use modulo of hash to assign worker
   ...: WORKER_COUNT = 32
   ...: worker_ids = [h % WORKER_COUNT for h in hashes]
   ...: print('worker assignments:\n', worker_ids)
   ...:
hashes:
 [3430018387555, 3429954387363, 3430146387939, 3430082387747, 3430274388323]

worker assignments:
 [3, 3, 3, 3, 3]

Clearly, the “bad balancing” in the dask example above isn’t THAT bad, so maybe I’m on the wrong track with this hashing theory.

Interestingly, dask.DataFrame.groupby() does NOT exhibit the bad balance that Bag.groupby() exhibits:

DataFrame Example

In [62]: df = ddf.from_pandas( pd.DataFrame({'x': 64*np.arange(100)}), npartitions=100 ).set_index('x')

In [63]: workers = df.groupby('x')\
        .apply(lambda _: pd.DataFrame({'worker': [get_worker().address]}),\
               meta=pd.DataFrame({'worker': ['']}))\
        .compute()['worker']

In [64]: frequencies(workers)
Out[64]:
{'tcp://10.36.111.27:37787': 4,
 'tcp://10.36.111.18:39199': 3,
 'tcp://10.36.111.36:39365': 5,
 'tcp://10.36.110.35:37716': 3,
 'tcp://10.36.111.16:46645': 3,
 'tcp://10.36.111.39:44899': 3,
 'tcp://10.36.110.32:35934': 3,
 'tcp://10.36.111.42:41962': 3,
 'tcp://10.36.110.38:38701': 3,
 'tcp://10.36.110.38:38503': 3,
 'tcp://10.36.111.21:33351': 3,
 'tcp://10.36.110.32:40587': 3,
 'tcp://10.36.110.32:40599': 3,
 'tcp://10.36.110.35:42767': 3,
 'tcp://10.36.110.38:44253': 3,
 'tcp://10.36.110.40:32966': 3,
 'tcp://10.36.110.40:41659': 3,
 'tcp://10.36.110.40:44968': 3,
 'tcp://10.36.111.18:40827': 3,
 'tcp://10.36.111.21:33749': 3,
 'tcp://10.36.111.21:45552': 3,
 'tcp://10.36.111.23:37433': 3,
 'tcp://10.36.111.27:46136': 4,
 'tcp://10.36.111.32:36918': 4,
 'tcp://10.36.111.32:40190': 4,
 'tcp://10.36.111.32:43984': 4,
 'tcp://10.36.111.33:45234': 4,
 'tcp://10.36.111.36:33107': 4,
 'tcp://10.36.111.36:35714': 4,
 'tcp://10.36.111.39:37165': 3}

In [65]: len(_)
Out[65]: 30

I’m not yet familiar enough with the scheduler internals to know which code to look at to understand how workers are chosen for groupby() results.

Issue Analytics

State:
Created 5 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Sep 25, 2019

I had hoped that the tasks would be evenly distributed to my 124 workers, with each one being given When I started the job, all tasks were assigned to a single worker!

This seems different from this issue. Dask doesn’t determine data location by hashing it. A new issue with an MCVE would be welcome.

0reactions

stuartebergcommented, Sep 26, 2019

Related, I think: Lately I’m seeing an even more exaggerated version of this behavior. In this case, the bad balance occurs before I even use groupby(). All I’m using here is from_sequence() and map_partitions().

Here are the stats of the job I’m running right now:

Bag size (total items): 511 num_partitions: 257 number of dask workers: 124

I had hoped that the tasks would be evenly distributed to my 124 workers, with each one being given a single partition. When I started the job, all tasks were assigned to a single worker! A minute later, I took the following screenshots. By that point, some of the tasks had been given to one other worker: But the balance is still quite bad. Most of my workers are sitting idle, while only two are processing any data.

Any tips on how to debug this behavior and achieve a better balance would be helpful.

I know that I should switch to DataFrames (or Arrays) to benefit from improved hashing (as mentioned above), but in this case I don’t think hashing is responsible. For instance, I’m specifically avoiding powers-of-two when choosing my partition size and number of workers, so I think ordinary Python hashing ought to be well-behaved in this instance.

FWIW, the specific workload here is fairly simple, and not very compute-intensive: Essentially, each partition consists of a bag of integers, and loading a single PNG file with the corresponding name (0.png, 1.png, 2.png, etc.). At this stage, most of the time is spent on disk I/O.

Top Results From Across the Web

Workload Analysis of a Large-Scale Key-Value Store

paper offers the following key contributions and findings: 1. A workload decomposition of the traces that shows how different applications of Memcached can ......

Parallelism - Erik Darling Data

This post reviews some of the possible explanations for those variations. ... course a balancing act, but it's one that you must perform...

Slow in the Application, Fast in SSMS? - Erland Sommarskog

Getting the Query Plans and Parameters with Management Studio ... no matter if the dynamic SQL is executed through EXEC() or sp_executesql.

CS-7646 - Machine Learning for Trading | OMSCS Reviews

The projects are fairly interesting after all the unnecessary fluff. ... The interactions on Ed Discussion with course management saves this class from ......

Berkeley Databases Prelim Notes - Shreya Shankar

2.1 Access Path Selection in a Relational Database Management System [47] . ... Upon receiving the client's first SQL command, the DBMS assigns...