Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimize groupby to use dictionary indices

See original GitHub issue

Thanks so much for releasing this project–I’ve been hoping for precisely this kind of engine for a while. Look forward to using it.

There’s some relatively low-hanging fruit for a performance bump on dictionary-encoded columns. (Probably string columns, too, although it’s a little trickier.) Currently, groupby on a dictionary(utf-8 entries with int32 keys) column is quite slow because–if I’m tracing the code correctly–each individual row is being decoded from utf-8.

Here’s a benchmark counting 2.5 million rows; in this case, it’s about 4.7 seconds to count 1,000 distinct fields on character keys and 0.2 seconds on their integer equivalent.

Similar optimizations should be possible for ‘filter’ and ‘join’ on dictionary columns, although there’s probably more overhead involved.

Issue Analytics

State:
Created 3 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

jheercommented, Sep 24, 2020

So I had a change of heart here: Arquero v0.9.0 now includes optimized dictionary unpacking (thanks @bmschmidt!). Given the significant performance improvement for a relatively small amount of extra code, plus the slow pace of Arrow JS development, this seems to have a good cost/benefit ratio. We can always simplify later if Arrow JS rolls out a more performant toArray implementation for dictionary columns in the future.

0reactions

jheercommented, Sep 20, 2020

I spent more time thinking about accessing Arrow dictionary keys directly as part of the groupby or hash-join / lookup logic. While possible, this is complicated by the fact that Arquero supports not just direct column values, but arbitrary formulas over those column values. In the case of groupby expressions, aggregate calculations are also permitted! Augmenting Arquero to access dictionary keys when available but not when arbitrary expressions are supplied incurs (IMHO) a poor cost-to-benefit ratio relative to other possible approaches. As a result I don’t plan to add direct dictionary key access and am closing out this issue. Instead, I think it would be ideal for Arrow to support more efficient unpacking / caching of extracted string values.