Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cdist is very slow if custom weight vector is supplied

See original GitHub issue

Is your feature request related to a problem? Please describe. Several distance metrics in cdist take an optional weight vector by which to scale the input vectors, e.g. sqeuclidean. The issue is that supplying such a vector causes significant performance degradation in terms of computational time.

The underlying bottleneck seems to be the result of the data validation done on the weight vector. The function _validate_vector in distance.py is called every time the cdist function is invoked. When cdist is used in an optimization problem with potentially many iterations, _validate_vector will be called myriads of times, essentially for no good reason.

The work-around I am currently using is to manually re-scale the input vectors and then supply them to cdist with the default None value for the weight vector. The performance increase is staggering.

Describe the solution you’d like Wouldn’t it be the Pythonic way to simply use the supplied weights vector as it is, and if malformatted raise an appropriate exception, so that the user makes sure the right data type is supplied? As it stands currently, validating data types and squashing to required dimensions is just too time consuming in the context of an optimization problem.

Describe alternatives you’ve considered Manually re-scale the input vectors by the custom weight vector and use cdist with no optional arguments.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:9 (6 by maintainers)

Top GitHub Comments

3reactions

peterbell10commented, Mar 4, 2021

Here is a straight-forward example:

In [1]: import numpy as np
   ...: from scipy.spatial.distance import cdist
   ...: a = np.random.rand(100, 10)
   ...: w = np.random.rand(10)
   ...: %timeit cdist(a, a, metric='euclidean')
   ...: %timeit cdist(a, a, w=w, metric='euclidean')
337 µs ± 3.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
173 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can see that introducing weights results in a 3 order of magnitude slow-down. Exactly because it falls-back to calling 'test_euclidean' individually for each x[i], y[j] pair in the distance matrix. There is a similar performance cliff for pdist as well.

I have time to work on this and am interested in basically re-writing spatial.distance. Ideally, the distance metric should be implemented once in C++ for weighted and non-weighted, and there would be infrastructure to generalize that to nd-arrays, cdist and pdist automatically.

0reactions

pmavrodievcommented, Mar 4, 2021

Could you share some timings, or benchmarking code? Staggering can mean 20% or 20x, would be nice to have a bit more detail.

Hi @rgommers Unfortunately, it’d be quite time-consuming for me to post a minimal working code example based off of my actual application. I have definitely measured at least one order of magnitude slow-down with vectors of moderate size, e.g. (20000 x 50), looping for about 1000 iterations.

I’m sure that any test with similarly sized vectors can show the performance difference with vs without custom weights.

spatial.distance is a bit of a mess indeed. Two large rewrites have been attempted over the years, but none of them was completed. There’s a lot of performance to gain, as well as design consistency.

I’d like to help out the community, but don’t have the bandwidth at the moment. Hopefully just reporting issues would help to prioritize internally.