Optimize groupby to use dictionary indices
See original GitHub issueThanks so much for releasing this project–I’ve been hoping for precisely this kind of engine for a while. Look forward to using it.
There’s some relatively low-hanging fruit for a performance bump on dictionary-encoded columns. (Probably string columns, too, although it’s a little trickier.) Currently, groupby
on a dictionary(utf-8 entries with int32 keys) column is quite slow because–if I’m tracing the code correctly–each individual row is being decoded from utf-8.
Similar optimizations should be possible for ‘filter’ and ‘join’ on dictionary columns, although there’s probably more overhead involved.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (7 by maintainers)
Top Results From Across the Web
How to speed up grouping and add to dictionary
I have code: Dictionary<long, List<Data>> daDictTmp = retValTmp .Where(w => w.IdData.HasValue) .GroupBy(d => d.IdData.Value) .
Read more >8.2.1.17 GROUP BY Optimization - MySQL :: Developer Zone
The most important preconditions for using indexes for GROUP BY are that all GROUP BY columns reference attributes from the same index, and...
Read more >c# - Need help finding an efficient way to group the keys in a ...
The test data I am using is a dictionary of up to N records, where each entry has on average 1/µ elements (I...
Read more >Group by: split-apply-combine — pandas 1.5.2 documentation
A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index...
Read more >7 Optimizing Joins with Join Groups - Database
The key optimization is joining on common dictionary codes instead of column values, thereby avoiding the use of a hash table for the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So I had a change of heart here: Arquero v0.9.0 now includes optimized dictionary unpacking (thanks @bmschmidt!). Given the significant performance improvement for a relatively small amount of extra code, plus the slow pace of Arrow JS development, this seems to have a good cost/benefit ratio. We can always simplify later if Arrow JS rolls out a more performant
toArray
implementation for dictionary columns in the future.I spent more time thinking about accessing Arrow dictionary keys directly as part of the
groupby
orhash-join
/lookup
logic. While possible, this is complicated by the fact that Arquero supports not just direct column values, but arbitrary formulas over those column values. In the case ofgroupby
expressions, aggregate calculations are also permitted! Augmenting Arquero to access dictionary keys when available but not when arbitrary expressions are supplied incurs (IMHO) a poor cost-to-benefit ratio relative to other possible approaches. As a result I don’t plan to add direct dictionary key access and am closing out this issue. Instead, I think it would be ideal for Arrow to support more efficient unpacking / caching of extracted string values.