question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimized groupby aggregations when grouping by a sorted index

See original GitHub issue

Similar request to https://github.com/dask/dask/issues/2999

When grouping by the index, and the index has known divisions, most aggregations could be a simple map_partitions. Since each partition already contains all the values for an output group[^1], there’s no need to exchange data between partitions.

However in these cases, we still do apply_concat_apply, _cum_agg, etc. and generate complex graphs that involve a lot of data transfer.

In normal pandas use, I’m not sure how common it is to groupby the index versus a column. However, in dask, using the known divisions of the index is highly recommended (see best practices) and something users should try to do, especially with large datasets. In particular, once shuffling performs better (xref https://github.com/dask/distributed/pull/5435), a pattern of doing one set_index up front (or using a partitioned data format like Parquet) and then many fast operations on that should be effective.

I’d want something like this to involve minimal data transfer after the set_index step:

import dask.dataframe as dd
df = dd.read_parquet(...)
# a savvy user recognizes this is worthwhile since there are multiple date ops to do next
df_by_day = df.set_index("date")

daily_users = df_by_day.groupby(["date", "user_id"]).count()
daily_sales = df_by_day.groupby(["date", "sale_amt"]).sum()

daily_summary = daily_counts.merge(daily_sales, on="date")
# ^ this should be fast since both groupbys have retained their `divisions`

Someday, it might be nice if users didn’t even have to do the set_index, and we had an optimization that could recognize that multiple groupbys would benefit from a pre-shuffle and insert one automatically. However, that’s a hard optimization to implement (might require HLEs https://github.com/dask/dask/issues/7933) and a ways off. Getting users to understand that they should use set_index more carefully than in pandas, and its importance as a performance tool, seems easier. As we do that, let’s make sure we’re taking as much advantage of it as possible.

[^1] When all the rows in a partition have the same index value, then you do need to combine partitions. For example: in divisions=[0, 1, 2, 2, 4, 5], the partitions containing 1-2, 2-2, and 2-4 would need to be combined, probably using the normal apply_concat_apply logic. However, since we know the divisions, we can be more selective about where we do this and reduce some transfer. With well-balanced partitions, this should be a relatively rare case, and there usually shouldn’t be more than a handful of consecutive partitions with the same value.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
gjoseph92commented, Dec 1, 2021

I also wonder just how many users’ workflows are failing because of this precise issue.

Agreed. I think this should be one of the top priority DataFrame issues to work on. I was honestly shocked when I read the code and realized we weren’t utilizing the sorted index.

the reason why I didn’t even think of mapping groupby using map_partitions at first is that it felt like I’m somehow circumventing Dask and doing something hacky and possibly wrong.

This is usually the right way to feel. It’s very reasonable that you’d expect Dask to be doing the obvious, correct thing here with groupby!

Here is a related SO question

That was a good question. I replied to it; this is a bug: https://github.com/dask/dask/issues/8437.

2reactions
jsignellcommented, Nov 9, 2021

YES! I am very excited about this idea. In general, there are a lot of opportunities for doing things better in groupby. I think the current implementations are written more for legibility and generalizability and less for performance.

Let me know if you are actively planning on working on this or would like to talk things through.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Effects of sorting and grouping on query optimization - IBM
Sorting occurs when no index satisfies the requested ordering of fetched rows. ... perform some or all of the GROUP BY aggregations while...
Read more >
How To Quickly Define an Efficient SQL Index for GROUP BY ...
GROUP BY queries allow you to partition a table into groups based on the values of one or more columns. Its purpose is...
Read more >
8.2.1.17 GROUP BY Optimization - MySQL :: Developer Zone
The query is over a single table. · The GROUP BY names only columns that form a leftmost prefix of the index and...
Read more >
Indexing GROUP BY - Use The Index, Luke
SQL databases use two entirely different group by algorithms. The first one, the hash algorithm, aggregates the input records in a temporary hash...
Read more >
Group by: split-apply-combine — pandas 1.5.2 documentation
By default the group keys are sorted during the groupby operation. ... Some common aggregations, currently only sum , mean , std ,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found