Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mismatched scores returned from AnalyzerEngine

See original GitHub issue

Describe the bug For some inputs, the results returned by the analyzer have unexpected scores of 1.0 under the attribute , despite no context words being present in the input at all. These scores are different than the scores listed analysis_explanation when return_decision_process = True (these are the expected scores).

To Reproduce

An example:

from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
text = 'You can call my phone 907-882-3534'
results = analyzer.analyze(text , language='en', return_decision_process = True)
print(results)
> [type: UK_NHS, start: 22, end: 34, score: 1.0, type: PHONE_NUMBER, start: 22, end: 34, score: 0.75]
# UK_NHS has a score of 1.0 despite no context words in input, and default pattern score = 0.5

print([i.score for i in results])
> [1.0***, 0.75]

print([i.analysis_explanation.score for i in results])
> [0.5**, 0.75]
# this is the expected score for UK_NHS entity

Expected behavior Scores should match for both attributes.

Additional context

presidio-analyzer==2.2.27
presidio-anonymizer==2.2.27

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

liaehcommented, Apr 4, 2022

@omri374 sure, I’ll create a PR to update the docs.

I agree it would be nice to have the confidence more configurable. In this scenario, since the result validation boosts the score right away to the MAX_SCORE, then there’s no more room for context to be taken into consideration.

1reaction

omri374commented, Apr 4, 2022

Since a checksum is a pretty strong guarantee of a number being a specific entity, we assign the confidence to 1.0. Having said that it could be that a phone number accidentally passes a checksum for another entity. One thing we can do is to have the confidence value more configurable.