question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError in distance matrix with agglomerative clustering

See original GitHub issue

Description

ValueError thrown when applying AgglomerativeClustering on textual data because distance matrix contains infinite values

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
def main():
        dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42, 
                                                            remove=('headers', 'footers', 'quotes') )
        data_samples = dataset.data
	targets = dataset.target
	categories = dataset.target_names
	k = np.unique(targets).shape[0]
	tf_vectorizer = TfidfVectorizer(max_features=50000, max_df=1.0, min_df=1)
	tfs = tf_vectorizer.fit_transform(data_samples)
	agg = AgglomerativeClustering(linkage="complete", n_clusters=k, affinity="cosine")
	agg.fit(tfs.toarray())
	return dataset

if __name__ == '__main__':
	main()

Expected Results

No error is thrown and the distance matrix should not contain infinite values

Actual Results

File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 750, in fit
    **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 527, in _complete_linkage
    return linkage_tree(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 417, in linkage_tree
    out = hierarchy.linkage(X, method=linkage, metric=affinity)
  File "/venv/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 713, in linkage
    raise ValueError("The condensed distance matrix must contain only "
ValueError: The condensed distance matrix must contain only finite values.

Versions

>>> import platform; print(platform.platform())
Linux-4.4.0-81-generic-x86_64-with-Ubuntu-16.04-xenial
>>> import sys; print("Python", sys.version)
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.0.0
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.19.0
>>> 

Comment I have used the same code on a subset of Reuters-21578 text data set and no error was thrown. I was not able to track down what might have caused the infinite values in the distance matrix

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

9reactions
jnothmancommented, Nov 6, 2017

You could get NaN cosine values if a vector has no non-zero elements. Is this possible in your case?

3reactions
selahlynchcommented, Nov 13, 2018

I’ve discovered that all 1’s will cause the same error. I searched for these df.columns[df.nunique() == 1] and dropped them and my problem was solved.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn agglomerative clustering linkage matrix - Stack Overflow
AgglomerativeClustering doesn't return the distance between clusters and the number of original observations, which scipy.cluster.hierarchy.dendrogram needs. Is ...
Read more >
sklearn.cluster.AgglomerativeClustering
Fit the hierarchical clustering from features, or distance matrix. Parameters: Xarray-like, shape (n_samples, n_features) or (n_samples, n_samples).
Read more >
Hierarchical clustering with precomputed cosine similarity ...
cluster.AgglomerativeClustering documentation it says: A distance matrix (instead of a similarity matrix) is needed as input for the fit method.
Read more >
scipy.cluster.hierarchy.linkage — SciPy v1.9.3 Manual
Perform hierarchical/agglomerative clustering. The input y may be either a 1-D condensed distance matrix or a 2-D array of observation vectors.
Read more >
scikit-learn/scikit-learn - Gitter
@rth Do you mean using this in place of agglomerative clustering or for chunking / precomputing the distance matrix?
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found