Different matching behavior across versions
See original GitHub issueHi there,
I’ve been using this package for some time now (thank you for writing it, it’s very useful), but only recently upgraded my version, from 0.3.2 to 0.6.1. On the new version, I’m getting behavior I hadn’t expected and which differs from the older version.
Given two Pandas Series s1
and s2
, I initialize and fit a StringGrouper
object:
sg = StringGrouper(
s1,
s2,
min_similarity=0.70,
max_n_matches=1
).fit()
Then I grab the raw matches from sg._matches_list
. (I know this isn’t the typical way to use this object, but it suits my use case.)
On 0.3.2 I get the behavior I usually rely on: the number of rows in the result (sg._matches_list.shape[0]
) is equal to the number of unique master_side
indices listed in the result (sg._matches_list['master_side'].nunique()
). Meaning, there is exactly one row per record in the master
list for which there was a sufficiently good match in the duplicates
list. In my case, there are duplicates in the list of dupe_side
indices, which is what I expect:
>>> sg._matches_list.shape
(193, 3)
>>> sg._matches_list['master_side'].nunique()
193
>>> sg._matches_list['dupe_side'].nunique()
127
On 0.6.1, I get the exact opposite behavior - one row per record in the duplicates
list for which there was a sufficiently good match in the master
list:
>>> sg._matches_list.shape
(193, 3)
>>> sg._matches_list['master_side'].nunique()
128
>>> sg._matches_list['dupe_side'].nunique()
193
Does anyone know why the behavior is so different between the two versions?
Issue Analytics
- State:
- Created a year ago
- Comments:9 (5 by maintainers)
Top GitHub Comments
😆 Oh man! I’m having flashbacks to when I changed the documentation. In particular, I now remember changing the statement you just quoted:
It used to read the opposite way:
So the docs were in fact changed. Oops. My bad. It wasn’t your fault. I browsed through
string_grouper
’s commit history to confirm this. Follow this link for the evidence (under README.md). It is also evident that I failed (not sure why!) to record this rather significant change in CHANGELOG.md.I now remember clearly changing this behavior in order to take advantage of a performance enhancer (as I said before) for
match_most_similar()
, namely, by simply settingmax_n_matches=1
, the best match inmaster
for each string induplicates
(as touted in the documentation formatch_most_similar()
) could be found in record time.But at that time,
master
andduplicates
were left and right operands respectively for the left join operation in the code that did the matching. That also meantmatch_most_similar()
was not really working as it was claiming to, as its return value would contain all the strings inmaster
rather than all the strings induplicates
(see this complaint by a user). To correct this, I swapped the positions ofmaster
andduplicates
with respect to the left join operation.I didn’t think anyone would even notice. But you did!
So the CHANGELOG needs updating after all.
No need to apologize! I think it’ll be a simple fix.
I think the
left
/right
language works pretty well - a reasonable way to refer to the two sides of the sparse matrix.As you had pointed out, the other part that I stumbled with was
max_n_matches
. I see now that the documentation does clearly define this term:Maybe a more use-agnostic way would be to make this configurable? Meaning, let the user decide if
max_n_matches
applies to the left side or the right side. This removes the need to refer to one side as themaster
side, which makes the whole thing a little more generalized.I’m not asking you to make revisions to the code or docs, just thinking out loud 😃