question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Clustering scores containing 0 fails filtering

See original GitHub issue

I compute the pairwise scores for some data, and pass these scores to clustering. If my scores contain any 0s and if connected_components requires filtering, then we go into an infinite loop and get stuck.

Here are the logs I get:

matching done, begin clustering
/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/dedupe/clustering.py:82: RuntimeWarning: divide by zero encountered in log
  min_score_logit = numpy.log(min_score) - numpy.log(1 - min_score)
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0

I wonder why this hasn’t been an issue before? Perhaps the old rlr implementation wouldn’t ever return a probability of 0, but now that we are using sklearn’s implementation of RegularizedLogisticRegression, it can return 0? Otherwise, is this the symptom of some other bug?

I believe this happens here, but I might be wrong: if min_score is 0, then numpy.log(min_score) is -inf, and then min_score_logit is also -inf, and then numpy.exp(-min_score_logit - 1) is inf, so finally threshold is set to 0. Therefore we don’t actually filter out any edges.

https://github.com/dedupeio/dedupe/blob/4116361854e5894f59655beb9905f60c2a0814a3/dedupe/clustering.py#L79-L97

I think solutions could either be

  1. pass side="right" to np.searchsorted, and then we are effectively doing edges[scores>threshold] instead of what we are doing now edges[scores>=threshold]
  2. if min_score is 0, set it to some epsilon like 1e-10

I just monkey patched in option 2 and it seems to be working for my issue.

EDIT: Actually, any edge with a probability of 0 should just be ignored from the get go, regardless of whether or not we’re doing filtering, and perhaps even should be dealt with during the score() step, before cluster()

@fgregg

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
fgreggcommented, Jul 18, 2022

or rather, only part of his change (the test in in there) https://github.com/dedupeio/dedupe/commit/55fd8bf9633e09f8200cf28e492c722e8745590a

0reactions
fgreggcommented, Aug 11, 2022

man. i don’t get how that happened. what a pain. Thanks for spelunking @NickCrews, yes let’s see a PR!

Read more comments on GitHub >

github_iconTop Results From Across the Web

2.3. Clustering — scikit-learn 1.2.0 documentation
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia...
Read more >
How to check error/accuracy of K-means clustering on new ...
I want to find the test error/score on predicted data using K means clustering how can i find that. The following example classify...
Read more >
Chapter 7 Clustering Analysis | An R Companion for ...
The clustering vector contains the cluster assignment for each data row and can be ... You need is to filter the rows corresponding...
Read more >
An Optimized Filtering Process for Cluster Selection in K-Means
To solve this problem, we can use a filtering process that chooses the clustering result with the best score among several trials.
Read more >
Unsupervised Learning and Data Clustering
A simple approach is to compare the results of multiple runs with different k classes and choose the best one according to a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found