Improve PatternRecognizer to respect multiple match groups
See original GitHub issuePatternRecognizer treats complex match groups as one match group, this is not desirable behaviour. It would be preferable if all match groups were individually considered a match.
Example:
from presidio_analyzer import PatternRecognizer, Pattern
text = 'this is an example string with many words'
patterns = [
# regex with non match group
Pattern('Non match group', r'\b(is)(?: an)', 0.9),
# regex with non match group
Pattern('Multiple match groups', r'(string).*?(words)', 0.9),
]
pr_wrong = PatternRecognizer(supported_entity='X', patterns=patterns)
pr_right = MultiPatternRecognizer(supported_entity='X', patterns=patterns)
print('Wrong')
for m in pr_wrong.analyze(text, entities='X'):
t = text[m.start: m.end]
print(m, f'- "{t}"')
print('\nCorrect')
for m in pr_right.analyze(text, entities='X'):
t = text[m.start: m.end]
print(m, f'- "{t}"')
Output:
Wrong
type: X, start: 5, end: 10, score: 0.9 - "is an"
type: X, start: 19, end: 41, score: 0.9 - "string with many words"
Correct
type: X, start: 5, end: 7, score: 0.9 - "is"
type: X, start: 19, end: 25, score: 0.9 - "string"
type: X, start: 36, end: 41, score: 0.9 - "words"
A simple change to PatternRecognizer.__analyze_patterns
something like this should work:
From:
for match in matches:
start, end = match.span()
current_match = text[start:end]
To:
for match in matches:
# Loop through each match group if more than one
offset = 0 if len(match.regs) == 1 else 1
for start, end in match.regs[offset:]:
current_match = text[start:end]
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (4 by maintainers)
Top Results From Across the Web
regex - How to capture multiple repeated groups?
"Capturing a repeated group captures all iterations": yes but it will capture ALL of them in only ONE match (containing them all). Your...
Read more >Rule-based matching · spaCy Usage Documentation
The Matcher Explorer lets you test the rule-based Matcher by creating token patterns interactively and running them over your text. Each token can...
Read more >Multiple Group Membership and Well-Being - Frontiers
For example, membership in multiple groups has been associated with not only improved emotional well-being (Binning et al., 2009; Jetten et ...
Read more >Pattern recognition (psychology) - Wikipedia
In psychology and cognitive neuroscience, pattern recognition describes a cognitive process that matches information from a stimulus with information ...
Read more >Interpreting Residual Plots to Improve Your Regression
More often, though, you'll have multiple explanatory variables, and these charts will look quite different from a plot of any one explanatory variable...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@omri374 I think that ‘an’ is encapsulated in a non-capturing group (?: an) In that case, the regex will group on ‘an’, but it will be excluded from the results
Pretty much. Worth noting that with this change
(string.*?words)
would return the legacy result of one match of “string with many words” and(string).*?(words)
would return two matches of “string” and “words”.My apologies, but I’m afraid I’m not interested in the Microsoft contribution guidelines and process. However, the code in the top post works as described and is sufficient for someone to make the change (great first issue).