cdist is very slow if custom weight vector is supplied
See original GitHub issueIs your feature request related to a problem? Please describe.
Several distance metrics in cdist
take an optional weight vector by which to scale the input vectors, e.g. sqeuclidean
. The issue is that supplying such a vector causes significant performance degradation in terms of computational time.
The underlying bottleneck seems to be the result of the data validation done on the weight vector. The function _validate_vector
in distance.py
is called every time the cdist
function is invoked. When cdist
is used in an optimization problem with potentially many iterations, _validate_vector
will be called myriads of times, essentially for no good reason.
The work-around I am currently using is to manually re-scale the input vectors and then supply them to cdist
with the default None
value for the weight vector. The performance increase is staggering.
Describe the solution you’d like Wouldn’t it be the Pythonic way to simply use the supplied weights vector as it is, and if malformatted raise an appropriate exception, so that the user makes sure the right data type is supplied? As it stands currently, validating data types and squashing to required dimensions is just too time consuming in the context of an optimization problem.
Describe alternatives you’ve considered
Manually re-scale the input vectors by the custom weight vector and use cdist
with no optional arguments.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9 (6 by maintainers)
Here is a straight-forward example:
We can see that introducing weights results in a 3 order of magnitude slow-down. Exactly because it falls-back to calling
'test_euclidean'
individually for eachx[i], y[j]
pair in the distance matrix. There is a similar performance cliff forpdist
as well.I have time to work on this and am interested in basically re-writing
spatial.distance
. Ideally, the distance metric should be implemented once in C++ for weighted and non-weighted, and there would be infrastructure to generalize that to nd-arrays,cdist
andpdist
automatically.Hi @rgommers Unfortunately, it’d be quite time-consuming for me to post a minimal working code example based off of my actual application. I have definitely measured at least one order of magnitude slow-down with vectors of moderate size, e.g. (20000 x 50), looping for about 1000 iterations.
I’m sure that any test with similarly sized vectors can show the performance difference with vs without custom weights.
I’d like to help out the community, but don’t have the bandwidth at the moment. Hopefully just reporting issues would help to prioritize internally.