Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ufunc.at perfomance >10x too slow

See original GitHub issue

I have created a Matlab-like accumarray function which tries to squeeze as much performance form numpy as possible for a specific list of functions: sum any all max min mean...etc.

The functions is available as a gist here.

There is another accumarray implementation available on github here, which was originally created by @ml31415, but over the last couple of days has had many of my suggestions incorporated (I have also updated my own gist in line with my recent suggestions).

The original purpose of this other implementation was to use scipy.weave to get massively improved performance over what was assumed to be the best-possible raw-numpy version. However it appears that most of the functions can be fairly heavily optimized without scipy.weave. [It’s not really important here, but for reference, the remaining functions which are difficult-to-optimize are min max and prod.]

The main point, however, is that it should be simple to optimize by just using ufunc.at for the relevant ufunc (as this is exactly what ufunc.at is intended for), yet ufunc.at gets miserable performance…about 15x slower than scipy.weave and 10-25x slower than a bunch of carefully written alternative numpy algorithms (where such optimized algorithms exist). Surely ufunc.at could be improved?

Also, and on a separate note, would there be any interest in including accumarray itself in numpy?

Here are some benchmarking stats produced by my function (testing and benchmarking code is inclued at the bottom of the gist). Note that the baseline times are obtained by sorting-spliting-and-looping, using the named numpy function for each group; whereas the optimised functions do some kind of handcrafted vectorised operation in most cases, except max min and prod which use ufunc.at. Note also that the actual observed speedup depends on a variety of properties of the input. Here we are using 100 000 indices uniformly picked from the interval [0, 1000). Specifically, about 25% of the values are 0 (for use with the bool tests), the remainder are uniformly distributed on [-50,25).

std baseline 190.4 ms optimised 7.3 ms ... 26.2 x faster
all baseline 77.2 ms optimised 8.6 ms ... 9.0 x faster
min baseline 65.3 ms optimised 50.0 ms ... 1.3 x faster
max baseline 64.4 ms optimised 45.8 ms ... 1.4 x faster
sum baseline 64.6 ms optimised 2.4 ms ... 27.3 x faster
var baseline 173.4 ms optimised 7.7 ms ... 22.4 x faster
prod baseline 63.5 ms optimised 50.2 ms ... 1.3 x faster
any baseline 75.5 ms optimised 7.0 ms ... 10.8 x faster
mean baseline 100.2 ms optimised 3.7 ms ... 26.9 x faster

Issue Analytics

State:
Created 8 years ago
Comments:16 (10 by maintainers)

Top GitHub Comments

2reactions

nschloecommented, Mar 3, 2021

Another hint: If you use numpy.add.at, a much faster alternative is numpy.bincount with its optional weight argument:

import perfplot
import numpy

numpy.random.seed(0)

def numpy_add_at(data):
    a, i = data
    out0 = numpy.zeros(1000)
    numpy.add.at(out0, i, a)
    return out0

def numpy_bincount(data):
    a, i = data
    return numpy.bincount(i, weights=a, minlength=1000)

perfplot.show(
    setup=lambda n: (numpy.random.rand(n), numpy.random.randint(0, 1000, n)),
    kernels=[numpy_add_at, numpy_bincount],
    n_range=[2 ** k for k in range(24)],
)

out

1reaction

nschloecommented, Jul 15, 2019

Just for reference, fastfunc (a small pybind11 project of mine) speeds things up by a factor of about 40.

Top Results From Across the Web

Why is numpy.any so slow over large arrays? - Stack Overflow

At first glance np.any seems like the obvious tool for the job, but it seems unexpectedly slow over large arrays.

Numpy Ufuncs Speed Vs For Loop Speed - ADocLib

As I said, the problem with ufunc.at is that it is very slow, so implementing an out It is still 10x slower than...

numpy.ufunc.at — NumPy v1.24 Manual

Performs unbuffered in place operation on operand 'a' for elements specified by 'indices'. For addition ufunc, this method is equivalent to a[indices] +=...

Optimizing Python in the Real World: NumPy, Numba, and the ...

Still, though, we're sitting at about a factor of 10 slower than the Fortran version. The problem is that the np.add.at() call here...

What's New — pandas 0.21.1 documentation - PyData |

This is a minor bug-fix release in the 0.21.x series and includes some small ... performance of float64 hash table operations, fixing some...