Clustering scores containing 0 fails filtering
See original GitHub issueI compute the pairwise scores for some data, and pass these scores to clustering. If my scores contain any 0s and if connected_components requires filtering, then we go into an infinite loop and get stuck.
Here are the logs I get:
matching done, begin clustering
/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/dedupe/clustering.py:82: RuntimeWarning: divide by zero encountered in log
min_score_logit = numpy.log(min_score) - numpy.log(1 - min_score)
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0
A component contained 217007 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0
I wonder why this hasn’t been an issue before? Perhaps the old rlr implementation wouldn’t ever return a probability of 0, but now that we are using sklearn’s implementation of RegularizedLogisticRegression, it can return 0? Otherwise, is this the symptom of some other bug?
I believe this happens here, but I might be wrong:
if min_score
is 0, then numpy.log(min_score)
is -inf
, and then min_score_logit
is also -inf
, and then numpy.exp(-min_score_logit - 1)
is inf
, so finally threshold
is set to 0. Therefore we don’t actually filter out any edges.
I think solutions could either be
- pass
side="right"
to np.searchsorted, and then we are effectively doingedges[scores>threshold]
instead of what we are doing nowedges[scores>=threshold]
- if min_score is 0, set it to some epsilon like 1e-10
I just monkey patched in option 2 and it seems to be working for my issue.
EDIT: Actually, any edge with a probability of 0 should just be ignored from the get go, regardless of whether or not we’re doing filtering, and perhaps even should be dealt with during the score()
step, before cluster()
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
or rather, only part of his change (the test in in there) https://github.com/dedupeio/dedupe/commit/55fd8bf9633e09f8200cf28e492c722e8745590a
man. i don’t get how that happened. what a pain. Thanks for spelunking @NickCrews, yes let’s see a PR!