Performance of pandas.algos.groupby_int64
See original GitHub issueFor dask.dataframe shuffle operations (groupby.apply, merge), when running with multiple threads per process, I sometimes find my computations dominated by pandas.algos.groupby_int64
. Looking at the source code for this it looks like it’s using dynamic pure python objects from Cython. I’m curious if there are ways to accelerate this function, particularly in multi-threaded situations (releasing the GIL).
One solution that comes to mind would be to do a single pass over labels
, pre-compute the length of each members
list in results
and then pre-allocate these as arrays. This might allow better GIL-releasing behavior.
Thoughts?
Issue Analytics
- State:
- Created 7 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
Group by: split-apply-combine — pandas 1.5.2 documentation
Series([1, 2, 3, 10, 20, 30], lst) In [17]: grouped = s.groupby(level=0) In [18]: grouped.first() Out[18]: 1 1 2 2 3 3 dtype:...
Read more >Python pandas - how to group close elements
Using diff is the right approach - just combine it with gt and cumsum and you have your groups. The idea is to...
Read more >Group-by From Scratch | Pythonic Perambulations
For the Pandas Groupby operation, there is some non-trivial scaling for small datasets, and as data grows large it execution time is ...
Read more >pandas GroupBy: Your Guide to Grouping Data in Python
Using Lambda Functions in .groupby(); Improving the Performance of .groupby(). pandas GroupBy: Putting It All Together; Conclusion.
Read more >Groupby-by From Scratch "Part 2"
In Python, the Pandas DataFrame library provides a fast, general implementation of this algorithm. Jake VanderPlas wrote an excellent blog ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I pushed it up: https://github.com/jreback/pandas/tree/groupby
(as s I said, running some perf numbers and a couple of edge cases), but give it a go