Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError in distance matrix with agglomerative clustering

See original GitHub issue

Description

ValueError thrown when applying AgglomerativeClustering on textual data because distance matrix contains infinite values

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
def main():
        dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42, 
                                                            remove=('headers', 'footers', 'quotes') )
        data_samples = dataset.data
	targets = dataset.target
	categories = dataset.target_names
	k = np.unique(targets).shape[0]
	tf_vectorizer = TfidfVectorizer(max_features=50000, max_df=1.0, min_df=1)
	tfs = tf_vectorizer.fit_transform(data_samples)
	agg = AgglomerativeClustering(linkage="complete", n_clusters=k, affinity="cosine")
	agg.fit(tfs.toarray())
	return dataset

if __name__ == '__main__':
	main()

Expected Results

No error is thrown and the distance matrix should not contain infinite values

Actual Results

File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 750, in fit
    **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 527, in _complete_linkage
    return linkage_tree(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 417, in linkage_tree
    out = hierarchy.linkage(X, method=linkage, metric=affinity)
  File "/venv/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 713, in linkage
    raise ValueError("The condensed distance matrix must contain only "
ValueError: The condensed distance matrix must contain only finite values.

Versions

>>> import platform; print(platform.platform())
Linux-4.4.0-81-generic-x86_64-with-Ubuntu-16.04-xenial
>>> import sys; print("Python", sys.version)
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.0.0
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.19.0
>>>

Comment I have used the same code on a subset of Reuters-21578 text data set and no error was thrown. I was not able to track down what might have caused the infinite values in the distance matrix

Issue Analytics

State:
Created 6 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

9reactions

jnothmancommented, Nov 6, 2017

You could get NaN cosine values if a vector has no non-zero elements. Is this possible in your case?

3reactions

selahlynchcommented, Nov 13, 2018

I’ve discovered that all 1’s will cause the same error. I searched for these df.columns[df.nunique() == 1] and dropped them and my problem was solved.

Top Results From Across the Web

sklearn agglomerative clustering linkage matrix - Stack Overflow

AgglomerativeClustering doesn't return the distance between clusters and the number of original observations, which scipy.cluster.hierarchy.dendrogram needs. Is ...

sklearn.cluster.AgglomerativeClustering

Fit the hierarchical clustering from features, or distance matrix. Parameters: Xarray-like, shape (n_samples, n_features) or (n_samples, n_samples).

Hierarchical clustering with precomputed cosine similarity ...

cluster.AgglomerativeClustering documentation it says: A distance matrix (instead of a similarity matrix) is needed as input for the fit method.

scipy.cluster.hierarchy.linkage — SciPy v1.9.3 Manual

Perform hierarchical/agglomerative clustering. The input y may be either a 1-D condensed distance matrix or a 2-D array of observation vectors.

scikit-learn/scikit-learn - Gitter

@rth Do you mean using this in place of agglomerative clustering or for chunking / precomputing the distance matrix?