question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve PatternRecognizer to respect multiple match groups

See original GitHub issue

PatternRecognizer treats complex match groups as one match group, this is not desirable behaviour. It would be preferable if all match groups were individually considered a match.

Example:

from presidio_analyzer import PatternRecognizer, Pattern

text = 'this is an example string with many words'

patterns = [
  # regex with non match group
  Pattern('Non match group', r'\b(is)(?: an)', 0.9),
  # regex with non match group
  Pattern('Multiple match groups', r'(string).*?(words)', 0.9),
]

pr_wrong = PatternRecognizer(supported_entity='X', patterns=patterns)
pr_right = MultiPatternRecognizer(supported_entity='X', patterns=patterns)

print('Wrong')
for m in pr_wrong.analyze(text, entities='X'):
  t = text[m.start: m.end]
  print(m, f'- "{t}"')
  
print('\nCorrect')
for m in pr_right.analyze(text, entities='X'):
  t = text[m.start: m.end]
  print(m, f'- "{t}"')

Output:

Wrong
type: X, start: 5, end: 10, score: 0.9 - "is an"
type: X, start: 19, end: 41, score: 0.9 - "string with many words"

Correct
type: X, start: 5, end: 7, score: 0.9 - "is"
type: X, start: 19, end: 25, score: 0.9 - "string"
type: X, start: 36, end: 41, score: 0.9 - "words"

A simple change to PatternRecognizer.__analyze_patterns something like this should work: From:

for match in matches:
    start, end = match.span()
    current_match = text[start:end]

To:

for match in matches:
    # Loop through each match group if more than one
    offset = 0 if len(match.regs) == 1 else 1
    for start, end in match.regs[offset:]:
        current_match = text[start:end]

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
SharonHartcommented, Jul 21, 2021

@omri374 I think that ‘an’ is encapsulated in a non-capturing group (?: an) In that case, the regex will group on ‘an’, but it will be excluded from the results

0reactions
mprojlbcommented, Jul 21, 2021

Pretty much. Worth noting that with this change (string.*?words) would return the legacy result of one match of “string with many words” and (string).*?(words) would return two matches of “string” and “words”.

My apologies, but I’m afraid I’m not interested in the Microsoft contribution guidelines and process. However, the code in the top post works as described and is sufficient for someone to make the change (great first issue).

Read more comments on GitHub >

github_iconTop Results From Across the Web

regex - How to capture multiple repeated groups?
"Capturing a repeated group captures all iterations": yes but it will capture ALL of them in only ONE match (containing them all). Your...
Read more >
Rule-based matching · spaCy Usage Documentation
The Matcher Explorer lets you test the rule-based Matcher by creating token patterns interactively and running them over your text. Each token can...
Read more >
Multiple Group Membership and Well-Being - Frontiers
For example, membership in multiple groups has been associated with not only improved emotional well-being (Binning et al., 2009; Jetten et ...
Read more >
Pattern recognition (psychology) - Wikipedia
In psychology and cognitive neuroscience, pattern recognition describes a cognitive process that matches information from a stimulus with information ...
Read more >
Interpreting Residual Plots to Improve Your Regression
More often, though, you'll have multiple explanatory variables, and these charts will look quite different from a plot of any one explanatory variable...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found