Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Clustering scores containing 0 fails filtering

See original GitHub issue

I compute the pairwise scores for some data, and pass these scores to clustering. If my scores contain any 0s and if connected_components requires filtering, then we go into an infinite loop and get stuck.

Here are the logs I get:

matching done, begin clustering
/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/dedupe/clustering.py:82: RuntimeWarning: divide by zero encountered in log
  min_score_logit = numpy.log(min_score) - numpy.log(1 - min_score)
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0

I wonder why this hasn’t been an issue before? Perhaps the old rlr implementation wouldn’t ever return a probability of 0, but now that we are using sklearn’s implementation of RegularizedLogisticRegression, it can return 0? Otherwise, is this the symptom of some other bug?

I believe this happens here, but I might be wrong: if min_score is 0, then numpy.log(min_score) is -inf, and then min_score_logit is also -inf, and then numpy.exp(-min_score_logit - 1) is inf, so finally threshold is set to 0. Therefore we don’t actually filter out any edges.

https://github.com/dedupeio/dedupe/blob/4116361854e5894f59655beb9905f60c2a0814a3/dedupe/clustering.py#L79-L97

I think solutions could either be

pass side="right" to np.searchsorted, and then we are effectively doing edges[scores>threshold] instead of what we are doing now edges[scores>=threshold]
if min_score is 0, set it to some epsilon like 1e-10

I just monkey patched in option 2 and it seems to be working for my issue.

EDIT: Actually, any edge with a probability of 0 should just be ignored from the get go, regardless of whether or not we’re doing filtering, and perhaps even should be dealt with during the score() step, before cluster()

@fgregg