question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimize groupby to use dictionary indices

See original GitHub issue

Thanks so much for releasing this project–I’ve been hoping for precisely this kind of engine for a while. Look forward to using it.

There’s some relatively low-hanging fruit for a performance bump on dictionary-encoded columns. (Probably string columns, too, although it’s a little trickier.) Currently, groupby on a dictionary(utf-8 entries with int32 keys) column is quite slow because–if I’m tracing the code correctly–each individual row is being decoded from utf-8.

Here’s a benchmark counting 2.5 million rows; in this case, it’s about 4.7 seconds to count 1,000 distinct fields on character keys and 0.2 seconds on their integer equivalent.

Similar optimizations should be possible for ‘filter’ and ‘join’ on dictionary columns, although there’s probably more overhead involved.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
jheercommented, Sep 24, 2020

So I had a change of heart here: Arquero v0.9.0 now includes optimized dictionary unpacking (thanks @bmschmidt!). Given the significant performance improvement for a relatively small amount of extra code, plus the slow pace of Arrow JS development, this seems to have a good cost/benefit ratio. We can always simplify later if Arrow JS rolls out a more performant toArray implementation for dictionary columns in the future.

0reactions
jheercommented, Sep 20, 2020

I spent more time thinking about accessing Arrow dictionary keys directly as part of the groupby or hash-join / lookup logic. While possible, this is complicated by the fact that Arquero supports not just direct column values, but arbitrary formulas over those column values. In the case of groupby expressions, aggregate calculations are also permitted! Augmenting Arquero to access dictionary keys when available but not when arbitrary expressions are supplied incurs (IMHO) a poor cost-to-benefit ratio relative to other possible approaches. As a result I don’t plan to add direct dictionary key access and am closing out this issue. Instead, I think it would be ideal for Arrow to support more efficient unpacking / caching of extracted string values.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to speed up grouping and add to dictionary
I have code: Dictionary<long, List<Data>> daDictTmp = retValTmp .Where(w => w.IdData.HasValue) .GroupBy(d => d.IdData.Value) .
Read more >
8.2.1.17 GROUP BY Optimization - MySQL :: Developer Zone
The most important preconditions for using indexes for GROUP BY are that all GROUP BY columns reference attributes from the same index, and...
Read more >
c# - Need help finding an efficient way to group the keys in a ...
The test data I am using is a dictionary of up to N records, where each entry has on average 1/µ elements (I...
Read more >
Group by: split-apply-combine — pandas 1.5.2 documentation
A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index...
Read more >
7 Optimizing Joins with Join Groups - Database
The key optimization is joining on common dictionary codes instead of column values, thereby avoiding the use of a hash table for the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found