question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cdist is very slow if custom weight vector is supplied

See original GitHub issue

Is your feature request related to a problem? Please describe. Several distance metrics in cdist take an optional weight vector by which to scale the input vectors, e.g. sqeuclidean. The issue is that supplying such a vector causes significant performance degradation in terms of computational time.

The underlying bottleneck seems to be the result of the data validation done on the weight vector. The function _validate_vector in distance.py is called every time the cdist function is invoked. When cdist is used in an optimization problem with potentially many iterations, _validate_vector will be called myriads of times, essentially for no good reason.

The work-around I am currently using is to manually re-scale the input vectors and then supply them to cdist with the default None value for the weight vector. The performance increase is staggering.

Describe the solution you’d like Wouldn’t it be the Pythonic way to simply use the supplied weights vector as it is, and if malformatted raise an appropriate exception, so that the user makes sure the right data type is supplied? As it stands currently, validating data types and squashing to required dimensions is just too time consuming in the context of an optimization problem.

Describe alternatives you’ve considered Manually re-scale the input vectors by the custom weight vector and use cdist with no optional arguments.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
peterbell10commented, Mar 4, 2021

Here is a straight-forward example:

In [1]: import numpy as np
   ...: from scipy.spatial.distance import cdist
   ...: a = np.random.rand(100, 10)
   ...: w = np.random.rand(10)
   ...: %timeit cdist(a, a, metric='euclidean')
   ...: %timeit cdist(a, a, w=w, metric='euclidean')
337 µs ± 3.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
173 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can see that introducing weights results in a 3 order of magnitude slow-down. Exactly because it falls-back to calling 'test_euclidean' individually for each x[i], y[j] pair in the distance matrix. There is a similar performance cliff for pdist as well.

I have time to work on this and am interested in basically re-writing spatial.distance. Ideally, the distance metric should be implemented once in C++ for weighted and non-weighted, and there would be infrastructure to generalize that to nd-arrays, cdist and pdist automatically.

0reactions
pmavrodievcommented, Mar 4, 2021

Could you share some timings, or benchmarking code? Staggering can mean 20% or 20x, would be nice to have a bit more detail.

Hi @rgommers Unfortunately, it’d be quite time-consuming for me to post a minimal working code example based off of my actual application. I have definitely measured at least one order of magnitude slow-down with vectors of moderate size, e.g. (20000 x 50), looping for about 1000 iterations.

I’m sure that any test with similarly sized vectors can show the performance difference with vs without custom weights.

spatial.distance is a bit of a mess indeed. Two large rewrites have been attempted over the years, but none of them was completed. There’s a lot of performance to gain, as well as design consistency.

I’d like to help out the community, but don’t have the bandwidth at the moment. Hopefully just reporting issues would help to prioritize internally.

Read more comments on GitHub >

github_iconTop Results From Across the Web

My custom distance for Bayesian Optimization runs very slow ...
The problem appears when I run the optimization with that kernel, the performance is very slow compared with the RBF. The computation of...
Read more >
scipy.spatial.distance.cdist — SciPy v1.9.3 Manual
If a string, the distance function can be 'braycurtis', 'canberra', ... w : array_like The weight vector for metrics that support weights (e.g.,...
Read more >
Python Scipy Distance Matrix
The distances between the vectors of matrix/matrices that were calculated ... The Python Scipy contains a method cdist() in a module ...
Read more >
torch.pdf - The Comprehensive R Archive Network
If any of tensors are non-scalar (i.e. their data has more than one element) and require gradient, then the Jacobian-vector product would be ......
Read more >
1) Introduction: Computational Geometry and scipy.spatial
Apart from the enhanced performance (at least for the data shapes tested above) of cdist over distance_matrix , cdist also has access to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found