ufunc.at perfomance >10x too slow
See original GitHub issueI have created a Matlab-like accumarray function which tries to squeeze as much performance form numpy as possible for a specific list of functions: sum any all max min mean...
etc.
The functions is available as a gist here.
There is another accumarray implementation available on github here, which was originally created by @ml31415, but over the last couple of days has had many of my suggestions incorporated (I have also updated my own gist in line with my recent suggestions).
The original purpose of this other implementation was to use scipy.weave
to get massively improved performance over what was assumed to be the best-possible raw-numpy version. However it appears that most of the functions can be fairly heavily optimized without scipy.weave
. [It’s not really important here, but for reference, the remaining functions which are difficult-to-optimize are min max
and prod
.]
The main point, however, is that it should be simple to optimize by just using ufunc.at
for the relevant ufunc
(as this is exactly what ufunc.at
is intended for), yet ufunc.at
gets miserable performance…about 15x slower than scipy.weave
and 10-25x slower than a bunch of carefully written alternative numpy algorithms (where such optimized algorithms exist). Surely ufunc.at
could be improved?
Also, and on a separate note, would there be any interest in including accumarray itself in numpy?
Here are some benchmarking stats produced by my function (testing and benchmarking code is inclued at the bottom of the gist). Note that the baseline
times are obtained by sorting-spliting-and-looping, using the named numpy function for each group; whereas the optimised functions do some kind of handcrafted vectorised operation in most cases, except max min
and prod
which use ufunc.at
. Note also that the actual observed speedup depends on a variety of properties of the input. Here we are using 100 000
indices uniformly picked from the interval [0, 1000)
. Specifically, about 25% of the values are 0
(for use with the bool tests), the remainder are uniformly distributed on [-50,25)
.
std baseline 190.4 ms optimised 7.3 ms ... 26.2 x faster
all baseline 77.2 ms optimised 8.6 ms ... 9.0 x faster
min baseline 65.3 ms optimised 50.0 ms ... 1.3 x faster
max baseline 64.4 ms optimised 45.8 ms ... 1.4 x faster
sum baseline 64.6 ms optimised 2.4 ms ... 27.3 x faster
var baseline 173.4 ms optimised 7.7 ms ... 22.4 x faster
prod baseline 63.5 ms optimised 50.2 ms ... 1.3 x faster
any baseline 75.5 ms optimised 7.0 ms ... 10.8 x faster
mean baseline 100.2 ms optimised 3.7 ms ... 26.9 x faster
Issue Analytics
- State:
- Created 8 years ago
- Comments:16 (10 by maintainers)
Top GitHub Comments
Another hint: If you use
numpy.add.at
, a much faster alternative isnumpy.bincount
with its optionalweight
argument:Just for reference, fastfunc (a small pybind11 project of mine) speeds things up by a factor of about 40.