question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of pandas.algos.groupby_int64

See original GitHub issue

For dask.dataframe shuffle operations (groupby.apply, merge), when running with multiple threads per process, I sometimes find my computations dominated by pandas.algos.groupby_int64. Looking at the source code for this it looks like it’s using dynamic pure python objects from Cython. I’m curious if there are ways to accelerate this function, particularly in multi-threaded situations (releasing the GIL).

One solution that comes to mind would be to do a single pass over labels, pre-compute the length of each members list in results and then pre-allocate these as arrays. This might allow better GIL-releasing behavior.

Thoughts?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
jrebackcommented, Sep 24, 2016
  [d9e51fe7] [3da4a8d7]
+  610.20μs     2.54ms      4.17  groupby.groupby_ngroups_float_100.time_sum
+    2.91ms    11.70ms      4.02  groupby.groupby_ngroups_float_10000.time_count
+   12.49ms    45.73ms      3.66  groupby.groupby_ngroups_float_100.time_unique
+    1.50ms     5.24ms      3.50  groupby.groupby_ngroups_float_100.time_tail
+  484.40ms      1.48s      3.06  groupby.groupby_multi_index.time_groupby_multi_index
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
0reactions
jrebackcommented, Sep 24, 2016

I pushed it up: https://github.com/jreback/pandas/tree/groupby

(as s I said, running some perf numbers and a couple of edge cases), but give it a go

Read more comments on GitHub >

github_iconTop Results From Across the Web

Group by: split-apply-combine — pandas 1.5.2 documentation
Series([1, 2, 3, 10, 20, 30], lst) In [17]: grouped = s.groupby(level=0) In [18]: grouped.first() Out[18]: 1 1 2 2 3 3 dtype:...
Read more >
Python pandas - how to group close elements
Using diff is the right approach - just combine it with gt and cumsum and you have your groups. The idea is to...
Read more >
Group-by From Scratch | Pythonic Perambulations
For the Pandas Groupby operation, there is some non-trivial scaling for small datasets, and as data grows large it execution time is ...
Read more >
pandas GroupBy: Your Guide to Grouping Data in Python
Using Lambda Functions in .groupby(); Improving the Performance of .groupby(). pandas GroupBy: Putting It All Together; Conclusion.
Read more >
Groupby-by From Scratch "Part 2"
In Python, the Pandas DataFrame library provides a fast, general implementation of this algorithm. Jake VanderPlas wrote an excellent blog ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found