Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different matching behavior across versions

See original GitHub issue

Hi there,

I’ve been using this package for some time now (thank you for writing it, it’s very useful), but only recently upgraded my version, from 0.3.2 to 0.6.1. On the new version, I’m getting behavior I hadn’t expected and which differs from the older version.

Given two Pandas Series s1 and s2, I initialize and fit a StringGrouper object:

sg = StringGrouper(
    s1, 
    s2, 
    min_similarity=0.70,
    max_n_matches=1
).fit()

Then I grab the raw matches from sg._matches_list. (I know this isn’t the typical way to use this object, but it suits my use case.)

On 0.3.2 I get the behavior I usually rely on: the number of rows in the result (sg._matches_list.shape[0]) is equal to the number of unique master_side indices listed in the result (sg._matches_list['master_side'].nunique()). Meaning, there is exactly one row per record in the master list for which there was a sufficiently good match in the duplicates list. In my case, there are duplicates in the list of dupe_side indices, which is what I expect:

>>> sg._matches_list.shape
(193, 3)
>>> sg._matches_list['master_side'].nunique()
193
>>> sg._matches_list['dupe_side'].nunique()
127

On 0.6.1, I get the exact opposite behavior - one row per record in the duplicates list for which there was a sufficiently good match in the master list:

>>> sg._matches_list.shape
(193, 3)
>>> sg._matches_list['master_side'].nunique()
128
>>> sg._matches_list['dupe_side'].nunique()
193

Does anyone know why the behavior is so different between the two versions?

Issue Analytics

State:
Created a year ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

ParticularMinercommented, Mar 26, 2022

😆 Oh man! I’m having flashbacks to when I changed the documentation. In particular, I now remember changing the statement you just quoted:

max_n_matches: The maximum number of matching strings in master allowed per string in duplicates.

It used to read the opposite way:

max_n_matches: The maximum number of matches allowed per string in master.

So the docs were in fact changed. Oops. My bad. It wasn’t your fault. I browsed through string_grouper’s commit history to confirm this. Follow this link for the evidence (under README.md). It is also evident that I failed (not sure why!) to record this rather significant change in CHANGELOG.md.

I now remember clearly changing this behavior in order to take advantage of a performance enhancer (as I said before) for match_most_similar(), namely, by simply setting max_n_matches=1, the best match in master for each string in duplicates (as touted in the documentation for match_most_similar()) could be found in record time.

But at that time, master and duplicates were left and right operands respectively for the left join operation in the code that did the matching. That also meant match_most_similar() was not really working as it was claiming to, as its return value would contain all the strings in master rather than all the strings in duplicates (see this complaint by a user). To correct this, I swapped the positions of master and duplicates with respect to the left join operation.

I didn’t think anyone would even notice. But you did!

So the CHANGELOG needs updating after all.

1reaction

probablyfinecommented, Mar 25, 2022

No need to apologize! I think it’ll be a simple fix.

I think the left/right language works pretty well - a reasonable way to refer to the two sides of the sparse matrix.

As you had pointed out, the other part that I stumbled with was max_n_matches. I see now that the documentation does clearly define this term:

max_n_matches: The maximum number of matching strings in master allowed per string in duplicates.

Maybe a more use-agnostic way would be to make this configurable? Meaning, let the user decide if max_n_matches applies to the left side or the right side. This removes the need to refer to one side as the master side, which makes the whole thing a little more generalized.

I’m not asking you to make revisions to the code or docs, just thinking out loud 😃

Top Results From Across the Web

The Matching Law: A Tutorial for Practitioners - PMC - NCBI

According to the matching law, relative rates of problem and appropriate behavior should “match” the relative amount of reinforcement associated with each ...

Matching Law: Practical Applications in ABA

Essentially, when 2 or more concurrent schedules exist, preference is shown to the behavior that achieves the highest amount of reinforcement. This relationship ......

Behavior Matching between Different Domains based on ...

This paper presents a preliminary analysis of matching behaviors of the behaviorally related users in different domains.

Dynamic Behavior Matching: A Complexity Analysis and New ...

One algorithm allows the user to bound the probability of false positives for a small trade-off in runtime complexity, and the other runs...

Behavior Matching in Multimodal Communication Is ...

Cross-recurrence analysis revealed that within each category tested (language, facial, gestural), interlocutors synchronized matching behaviors, ...