Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimized groupby aggregations when grouping by a sorted index

See original GitHub issue

Similar request to https://github.com/dask/dask/issues/2999

When grouping by the index, and the index has known divisions, most aggregations could be a simple map_partitions. Since each partition already contains all the values for an output group[^1], there’s no need to exchange data between partitions.

However in these cases, we still do apply_concat_apply, _cum_agg, etc. and generate complex graphs that involve a lot of data transfer.

In normal pandas use, I’m not sure how common it is to groupby the index versus a column. However, in dask, using the known divisions of the index is highly recommended (see best practices) and something users should try to do, especially with large datasets. In particular, once shuffling performs better (xref https://github.com/dask/distributed/pull/5435), a pattern of doing one set_index up front (or using a partitioned data format like Parquet) and then many fast operations on that should be effective.

I’d want something like this to involve minimal data transfer after the set_index step:

import dask.dataframe as dd
df = dd.read_parquet(...)
# a savvy user recognizes this is worthwhile since there are multiple date ops to do next
df_by_day = df.set_index("date")

daily_users = df_by_day.groupby(["date", "user_id"]).count()
daily_sales = df_by_day.groupby(["date", "sale_amt"]).sum()

daily_summary = daily_counts.merge(daily_sales, on="date")
# ^ this should be fast since both groupbys have retained their `divisions`

Someday, it might be nice if users didn’t even have to do the set_index, and we had an optimization that could recognize that multiple groupbys would benefit from a pre-shuffle and insert one automatically. However, that’s a hard optimization to implement (might require HLEs https://github.com/dask/dask/issues/7933) and a ways off. Getting users to understand that they should use set_index more carefully than in pandas, and its importance as a performance tool, seems easier. As we do that, let’s make sure we’re taking as much advantage of it as possible.

[^1] When all the rows in a partition have the same index value, then you do need to combine partitions. For example: in divisions=[0, 1, 2, 2, 4, 5], the partitions containing 1-2, 2-2, and 2-4 would need to be combined, probably using the normal apply_concat_apply logic. However, since we know the divisions, we can be more selective about where we do this and reduce some transfer. With well-balanced partitions, this should be a relatively rare case, and there usually shouldn’t be more than a handful of consecutive partitions with the same value.

Issue Analytics

State:
Created 2 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

2reactions

gjoseph92commented, Dec 1, 2021

I also wonder just how many users’ workflows are failing because of this precise issue.

Agreed. I think this should be one of the top priority DataFrame issues to work on. I was honestly shocked when I read the code and realized we weren’t utilizing the sorted index.

the reason why I didn’t even think of mapping groupby using map_partitions at first is that it felt like I’m somehow circumventing Dask and doing something hacky and possibly wrong.

This is usually the right way to feel. It’s very reasonable that you’d expect Dask to be doing the obvious, correct thing here with groupby!

Here is a related SO question

That was a good question. I replied to it; this is a bug: https://github.com/dask/dask/issues/8437.

2reactions

jsignellcommented, Nov 9, 2021

YES! I am very excited about this idea. In general, there are a lot of opportunities for doing things better in groupby. I think the current implementations are written more for legibility and generalizability and less for performance.

Let me know if you are actively planning on working on this or would like to talk things through.

Top Results From Across the Web

Effects of sorting and grouping on query optimization - IBM

Sorting occurs when no index satisfies the requested ordering of fetched rows. ... perform some or all of the GROUP BY aggregations while...

How To Quickly Define an Efficient SQL Index for GROUP BY ...

GROUP BY queries allow you to partition a table into groups based on the values of one or more columns. Its purpose is...

8.2.1.17 GROUP BY Optimization - MySQL :: Developer Zone

The query is over a single table. · The GROUP BY names only columns that form a leftmost prefix of the index and...

Indexing GROUP BY - Use The Index, Luke

SQL databases use two entirely different group by algorithms. The first one, the hash algorithm, aggregates the input records in a temporary hash...

Group by: split-apply-combine — pandas 1.5.2 documentation

By default the group keys are sorted during the groupby operation. ... Some common aggregations, currently only sum , mean , std ,...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Optimized groupby aggregations when grouping by a sorted index

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

pure delayed gives wrong result for dataclass methods

Could not deserialize task when using `npartitions="auto"` in `DataFrame.set_index()`