Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of pandas.algos.groupby_int64

See original GitHub issue

For dask.dataframe shuffle operations (groupby.apply, merge), when running with multiple threads per process, I sometimes find my computations dominated by pandas.algos.groupby_int64. Looking at the source code for this it looks like it’s using dynamic pure python objects from Cython. I’m curious if there are ways to accelerate this function, particularly in multi-threaded situations (releasing the GIL).

One solution that comes to mind would be to do a single pass over labels, pre-compute the length of each members list in results and then pre-allocate these as arrays. This might allow better GIL-releasing behavior.

Thoughts?

Issue Analytics

State:
Created 7 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

2reactions

jrebackcommented, Sep 24, 2016

  [d9e51fe7] [3da4a8d7]
+  610.20μs     2.54ms      4.17  groupby.groupby_ngroups_float_100.time_sum
+    2.91ms    11.70ms      4.02  groupby.groupby_ngroups_float_10000.time_count
+   12.49ms    45.73ms      3.66  groupby.groupby_ngroups_float_100.time_unique
+    1.50ms     5.24ms      3.50  groupby.groupby_ngroups_float_100.time_tail
+  484.40ms      1.48s      3.06  groupby.groupby_multi_index.time_groupby_multi_index
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

0reactions

jrebackcommented, Sep 24, 2016

I pushed it up: https://github.com/jreback/pandas/tree/groupby

(as s I said, running some perf numbers and a couple of edge cases), but give it a go