Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve PatternRecognizer to respect multiple match groups

See original GitHub issue

PatternRecognizer treats complex match groups as one match group, this is not desirable behaviour. It would be preferable if all match groups were individually considered a match.

Example:

from presidio_analyzer import PatternRecognizer, Pattern

text = 'this is an example string with many words'

patterns = [
  # regex with non match group
  Pattern('Non match group', r'\b(is)(?: an)', 0.9),
  # regex with non match group
  Pattern('Multiple match groups', r'(string).*?(words)', 0.9),
]

pr_wrong = PatternRecognizer(supported_entity='X', patterns=patterns)
pr_right = MultiPatternRecognizer(supported_entity='X', patterns=patterns)

print('Wrong')
for m in pr_wrong.analyze(text, entities='X'):
  t = text[m.start: m.end]
  print(m, f'- "{t}"')
  
print('\nCorrect')
for m in pr_right.analyze(text, entities='X'):
  t = text[m.start: m.end]
  print(m, f'- "{t}"')

Output:

Wrong
type: X, start: 5, end: 10, score: 0.9 - "is an"
type: X, start: 19, end: 41, score: 0.9 - "string with many words"

Correct
type: X, start: 5, end: 7, score: 0.9 - "is"
type: X, start: 19, end: 25, score: 0.9 - "string"
type: X, start: 36, end: 41, score: 0.9 - "words"

A simple change to PatternRecognizer.__analyze_patterns something like this should work: From:

for match in matches:
    start, end = match.span()
    current_match = text[start:end]

To:

for match in matches:
    # Loop through each match group if more than one
    offset = 0 if len(match.regs) == 1 else 1
    for start, end in match.regs[offset:]:
        current_match = text[start:end]

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

SharonHartcommented, Jul 21, 2021

@omri374 I think that ‘an’ is encapsulated in a non-capturing group (?: an) In that case, the regex will group on ‘an’, but it will be excluded from the results

0reactions

mprojlbcommented, Jul 21, 2021

Pretty much. Worth noting that with this change (string.*?words) would return the legacy result of one match of “string with many words” and (string).*?(words) would return two matches of “string” and “words”.

My apologies, but I’m afraid I’m not interested in the Microsoft contribution guidelines and process. However, the code in the top post works as described and is sufficient for someone to make the change (great first issue).

Top Results From Across the Web

regex - How to capture multiple repeated groups?

"Capturing a repeated group captures all iterations": yes but it will capture ALL of them in only ONE match (containing them all). Your...

Rule-based matching · spaCy Usage Documentation

The Matcher Explorer lets you test the rule-based Matcher by creating token patterns interactively and running them over your text. Each token can...

Multiple Group Membership and Well-Being - Frontiers

For example, membership in multiple groups has been associated with not only improved emotional well-being (Binning et al., 2009; Jetten et ...

Pattern recognition (psychology) - Wikipedia

In psychology and cognitive neuroscience, pattern recognition describes a cognitive process that matches information from a stimulus with information ...

Interpreting Residual Plots to Improve Your Regression

More often, though, you'll have multiple explanatory variables, and these charts will look quite different from a plot of any one explanatory variable...