compute pairwise_distance with custom metric function for non-numeric data
See original GitHub issueThis is a feature request: I’m working with strings and have a custom metric function to compute similarities between the strings. I would love to use the pairwise_distance function together with this custom metric to compute the whole distance/similarity matrix for my data. However, I’m getting a ValueError: could not convert string to float: 'some string'
when the X and Y arrays are checked. It would be great if these checks could be made optional for custom metric functions.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:10 (8 by maintainers)
Top Results From Across the Web
How can I construct a pairwise distance matrix using a custom ...
I would like to create a program that computes a distance matrix from the results of my calculations on sets. Data about these...
Read more >sklearn.metrics.pairwise_distances
Compute the distance matrix from a vector array X and optional Y. This method takes either a vector array or a distance matrix,...
Read more >dispRity: Measuring Disparity
If the dispRity data has custom subsets with a single group, ... Each method for calculating distance is expressed as a function of...
Read more >Pairwise Mahalanobis distances - Cross Validated
So, center columns of the data matrix, compute the hat matrix, ... of the cloud and replace each pairwise distance by the corresponding ......
Read more >Pairwise distance between pairs of observations - MATLAB pdist
Define a custom distance function that ignores coordinates with NaN values, and compute pairwise distance by using the custom distance function.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I wanted to use the function to compute a distance matrix that I was using elsewhere, i.e., outside of sklearn; I figured the function had some nice parallelization built in or was in some other way more efficient than a naive implementation. So yes, it’s probably of limited value in conjunction with sklearn models, but even if there the better solution would be to pass a precomputed distance matrix, this matrix needs to be computed somehow as well. And considering that it’s probably just a matter of adding one parameter
check_input=True
and then one if statement before the arrays are checked, I think it’s worth it, even if the benefit doesn’t extend to other sklearn models.I have to add something to this topic. I am the main maintainer of scikit-fda, a project that implements functional data methods compatible with scikit-learn. In our case we do not even have arrays, as our data represent functions, so we have our own objects analog to a 1d array of functions (but sharing common things between them). Also we have developed functional metrics, such as the Lp metrics (which use integrals instead of sums). We even know how to compute the pairwise distance for some of these metrics in a more efficient way than the naive implementation (for example multiplying the weights of the quadrature and using einsum). Moreover, we want to apply some distance-based methods almost verbatim to our objects, such as knn and agglomerative clustering, an objective that we currently achieve wrapping the estimators and using the “precomputed” distance.
In summary, it would be nice if we had support for the following things: