AffinityPropagation creates 3d array of cluster centers on rare occasions
See original GitHub issueDescription
Just stumbled upon a rare combination of training data and preference
value that causes the model to save its cluster centers as a 3d ndarray
instead of expected 2d.
Steps/Code to Reproduce
import numpy as np
from sklearn.cluster.affinity_propagation_ import AffinityPropagation
train_data = np.array([[-1., 1.], [1., -1.]])
model = AffinityPropagation(preference=-10).fit(train_data)
model.cluster_centers_
yields
array([[[-1., 1.], [ 1., -1.]]]) # 3d!!
and
model.predict(train_data)
leads to
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/cluster/affinity_propagation_.py", line 324, in predict
return pairwise_distances_argmin(X, self.cluster_centers_)
File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 464, in pairwise_distances_argmin
metric_kwargs)[0]
File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 339, in pairwise_distances_argmin_min
X, Y = check_pairwise_arrays(X, Y)
File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 111, in check_pairwise_arrays
warn_on_dtype=warn_on_dtype, estimator=estimator)
File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/utils/validation.py", line 405, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 3. check_pairwise_arrays expected <= 2.
When using slightly different values for preference
(e.g. 0 or -20), or slightly different training data (e.g. [[-1, 1], [1, -0.9]]), cluster centers are stored correctly as 2d ndarray
.
Expected Results
Cluster centers to be stored as 2d ndarray
, as in normal cases.
Versions
Darwin-15.6.0-x86_64-i386-64bit (‘Python’, ‘2.7.13 (default, Jul 18 2017, 09:16:53) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]’) (‘NumPy’, ‘1.13.1’) (‘SciPy’, ‘0.19.1’) (‘Scikit-Learn’, ‘0.18.2’)
Issue Analytics
- State:
- Created 6 years ago
- Comments:16 (13 by maintainers)
Top Results From Across the Web
sklearn.cluster.AffinityPropagation
Fit clustering from features/affinity matrix; return cluster labels. Parameters: X{array-like, sparse matrix} of shape (n_samples, n_features), or array- ...
Read more >Clustering Algorithms: From Start to State of the Art - Toptal
The algorithm begins by selecting k points as starting centroids ('centers' of clusters). We can just select any k random points, or we...
Read more >Subspace clustering using affinity propagation - UConn Math
This method starts with the similarity measures between pairs of data points and keeps passing real-valued messages between data points until a high-...
Read more >Affinity Propagation preferences initialization - Stack Overflow
I thought affinity propagation could be my choice, since I could control the number of clusters by setting the preference parameter. However, if ......
Read more >APCluster - An R Package for Affinity Propagation Clustering
Affinity propagation (AP) is a relatively new clustering algorithm that has been ... The function apcluster() creates an object belonging to the S4...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I saw the floating point issues only when dealing with the edge case above, i.e. [[-1, 1], [1, -1]] as training samples. Depending on preference and damping params, the
A
andR
diagonals would “converge” to e.g. [0.45, -0.45] and [-0.45, 0.45]. But then the codewould somehow lead to different values of
E
(andK
) per (small sequence of) iteration(s). This would then lead to the incidental non-convergence for particular values of preferences, as I mentioned before (i.e. convergence to K=2 when preference=0, convergence to K=1 when preference<-20, but intermittent convergence to K=1 or non-convergence for preference in <-20, -9].The solution in the PR immediately returns cluster centers for the edge case above without running the actual algorithm, and as such avoids the rounding issues.
I am using OSX so I have Hombrew’s Python3 and installed scikit-learn and numpy via pip. I do not use Homebrew’s numpy install.