Mismatched scores returned from AnalyzerEngine
See original GitHub issueDescribe the bug
For some inputs, the results returned by the analyzer have unexpected scores of 1.0 under the attribute , despite no context words being present in the input at all. These scores are different than the scores listed analysis_explanation
when return_decision_process = True
(these are the expected scores).
To Reproduce
An example:
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
text = 'You can call my phone 907-882-3534'
results = analyzer.analyze(text , language='en', return_decision_process = True)
print(results)
> [type: UK_NHS, start: 22, end: 34, score: 1.0, type: PHONE_NUMBER, start: 22, end: 34, score: 0.75]
# UK_NHS has a score of 1.0 despite no context words in input, and default pattern score = 0.5
print([i.score for i in results])
> [1.0***, 0.75]
print([i.analysis_explanation.score for i in results])
> [0.5**, 0.75]
# this is the expected score for UK_NHS entity
Expected behavior Scores should match for both attributes.
Additional context
presidio-analyzer==2.2.27
presidio-anonymizer==2.2.27
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
presidio/test_analyzer_engine.py at main · microsoft ... - GitHub
return AppTracerMock(enable_decision_process=True) ... return analyzer_engine ... This analyzer engine is different from the global one, as this one.
Read more >Presidio Analyzer Python API - Microsoft Open Source
Enhance confidence score using context of the entity. Override this method in derived class in case a custom logic is needed, otherwise return...
Read more >presidio-analyzer - PyPI
The Presidio analyzer is a Python based service for detecting PII entities in text. During analysis, it runs a set of different PII...
Read more >presidio_analyzer.AnalyzerEngine Example - Program Talk
Project Creator : microsoft. def analyzer_engine(): """Return AnalyzerEngine.""" return AnalyzerEngine() @st.cache(allow_output_mutation=True).
Read more >3.1. Debug Checks — Clang 11 documentation
These checkers are used to dump the results of various infrastructural ... These checkers print information about the path taken by the analyzer...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@omri374 sure, I’ll create a PR to update the docs.
I agree it would be nice to have the confidence more configurable. In this scenario, since the result validation boosts the score right away to the MAX_SCORE, then there’s no more room for context to be taken into consideration.
Since a checksum is a pretty strong guarantee of a number being a specific entity, we assign the confidence to 1.0. Having said that it could be that a phone number accidentally passes a checksum for another entity. One thing we can do is to have the confidence value more configurable.