ValueError in distance matrix with agglomerative clustering
See original GitHub issueDescription
ValueError thrown when applying AgglomerativeClustering on textual data because distance matrix contains infinite values
Steps/Code to Reproduce
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
def main():
dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42,
remove=('headers', 'footers', 'quotes') )
data_samples = dataset.data
targets = dataset.target
categories = dataset.target_names
k = np.unique(targets).shape[0]
tf_vectorizer = TfidfVectorizer(max_features=50000, max_df=1.0, min_df=1)
tfs = tf_vectorizer.fit_transform(data_samples)
agg = AgglomerativeClustering(linkage="complete", n_clusters=k, affinity="cosine")
agg.fit(tfs.toarray())
return dataset
if __name__ == '__main__':
main()
Expected Results
No error is thrown and the distance matrix should not contain infinite values
Actual Results
File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 750, in fit
**kwargs)
File "/venv/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 527, in _complete_linkage
return linkage_tree(*args, **kwargs)
File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 417, in linkage_tree
out = hierarchy.linkage(X, method=linkage, metric=affinity)
File "/venv/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 713, in linkage
raise ValueError("The condensed distance matrix must contain only "
ValueError: The condensed distance matrix must contain only finite values.
Versions
>>> import platform; print(platform.platform())
Linux-4.4.0-81-generic-x86_64-with-Ubuntu-16.04-xenial
>>> import sys; print("Python", sys.version)
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.0.0
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.19.0
>>>
Comment I have used the same code on a subset of Reuters-21578 text data set and no error was thrown. I was not able to track down what might have caused the infinite values in the distance matrix
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
sklearn agglomerative clustering linkage matrix - Stack Overflow
AgglomerativeClustering doesn't return the distance between clusters and the number of original observations, which scipy.cluster.hierarchy.dendrogram needs. Is ...
Read more >sklearn.cluster.AgglomerativeClustering
Fit the hierarchical clustering from features, or distance matrix. Parameters: Xarray-like, shape (n_samples, n_features) or (n_samples, n_samples).
Read more >Hierarchical clustering with precomputed cosine similarity ...
cluster.AgglomerativeClustering documentation it says: A distance matrix (instead of a similarity matrix) is needed as input for the fit method.
Read more >scipy.cluster.hierarchy.linkage — SciPy v1.9.3 Manual
Perform hierarchical/agglomerative clustering. The input y may be either a 1-D condensed distance matrix or a 2-D array of observation vectors.
Read more >scikit-learn/scikit-learn - Gitter
@rth Do you mean using this in place of agglomerative clustering or for chunking / precomputing the distance matrix?
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You could get NaN cosine values if a vector has no non-zero elements. Is this possible in your case?
I’ve discovered that all 1’s will cause the same error. I searched for these
df.columns[df.nunique() == 1]
and dropped them and my problem was solved.