question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Presidio with Estonian

See original GitHub issue

Describe the bug I am trying to use presidio with the Estonian language. The Estonian language is supported by stanza (see this link). However, I see this error:

UnsupportedProcessorError: Processor ner is not known for language et. If you have created your own model, please specify the ner_model_path parameter when creating the pipeline.

The Estonian model is not a ner model, but I don’t know how to change processors.

To Reproduce This is my code

import stanza
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

# stanza.download("et") # One time download


configuration = { "nlp_engine_name": "stanza", "models": [{"lang_code": "et", "model_name": "et"}] }

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_stanza = provider.create_engine()

analyzer = AnalyzerEngine(nlp_engine=nlp_engine_with_stanza)

# Analyze
results= analyzer.analyze(text="Minu nimi on Raphael", language="et", return_decision_process=True)
print(results)
print(results[0].analysis_explanation)

Additional context Here are the full error log:

2022-02-06 12:58:19 INFO: Loading these models for language: et (Estonian):
=======================
| Processor | Package |
-----------------------
| tokenize  | edt     |
| pos       | edt     |
| lemma     | edt     |
| ner       | default |
=======================

2022-02-06 12:58:19 INFO: Use device: gpu
2022-02-06 12:58:19 INFO: Loading: tokenize
2022-02-06 12:58:19 INFO: Loading: pos
2022-02-06 12:58:20 INFO: Loading: lemma
2022-02-06 12:58:20 INFO: Loading: ner
2022-02-06 12:58:20 ERROR: Cannot load model from /data/home/hassan.eldeeb/stanza_resources/et/ner/default.pt
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/pipeline/core.py in __init__(self, lang, dir, package, processors, logging_level, verbose, use_gpu, model_dir, **kwargs)
    140                 # try to build processor, throw an exception if there is a requirements issue
--> 141                 self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
    142                                                                                           pipeline=self,

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/pipeline/processor.py in __init__(self, config, pipeline, use_gpu)
    158         if not hasattr(self, '_variant'):
--> 159             self._set_up_model(config, use_gpu)
    160 

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/pipeline/ner_processor.py in _set_up_model(self, config, use_gpu)
     26                 'charlm_backward_file': config.get('backward_charlm_path', None)}
---> 27         self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu)
     28 

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/models/ner/trainer.py in __init__(self, args, vocab, pretrain, model_file, use_cuda, train_classifier_only)
     51             # load everything from file
---> 52             self.load(model_file, args)
     53         else:

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/models/ner/trainer.py in load(self, filename, args)
    136         try:
--> 137             checkpoint = torch.load(filename, lambda storage, loc: storage)
    138         except BaseException:

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
    593 
--> 594     with _open_file_like(f, 'rb') as opened_file:
    595         if _is_zipfile(opened_file):

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/torch/serialization.py in _open_file_like(name_or_buffer, mode)
    229     if _is_path(name_or_buffer):
--> 230         return _open_file(name_or_buffer, mode)
    231     else:

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/torch/serialization.py in __init__(self, name, mode)
    210     def __init__(self, name, mode):
--> 211         super(_open_file, self).__init__(open(name, mode))
    212 

FileNotFoundError: [Errno 2] No such file or directory: '/data/home/hassan.eldeeb/stanza_resources/et/ner/default.pt'

During handling of the above exception, another exception occurred:

UnsupportedProcessorError                 Traceback (most recent call last)
/tmp/ipykernel_2501/3553649706.py in <module>
      9 
     10 provider = NlpEngineProvider(nlp_configuration=configuration)
---> 11 nlp_engine_with_stanza = provider.create_engine()
     12 
     13 analyzer = AnalyzerEngine(nlp_engine=nlp_engine_with_stanza, supported_languages=["et"])

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/presidio_analyzer/nlp_engine/nlp_engine_provider.py in create_engine(self)
     79                 for m in self.nlp_configuration["models"]
     80             }
---> 81             engine = nlp_engine_class(nlp_engine_opts)
     82             logger.info(
     83                 f"Created NLP engine: {engine.engine_name}. "

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/presidio_analyzer/nlp_engine/stanza_nlp_engine.py in __init__(self, models)
     32         logger.debug(f"Loading Stanza models: {models.values()}")
     33 
---> 34         self.nlp = {
     35             lang_code: spacy_stanza.load_pipeline(
     36                 model_name,

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/presidio_analyzer/nlp_engine/stanza_nlp_engine.py in <dictcomp>(.0)
     33 
     34         self.nlp = {
---> 35             lang_code: spacy_stanza.load_pipeline(
     36                 model_name,
     37                 processors="tokenize,pos,lemma,ner",

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy_stanza/__init__.py in load_pipeline(name, lang, dir, package, processors, logging_level, verbose, use_gpu, **kwargs)
     48     config["nlp"]["tokenizer"]["use_gpu"] = use_gpu
     49     config["nlp"]["tokenizer"]["kwargs"].update(kwargs)
---> 50     return blank(name, config=config)

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy/__init__.py in blank(name, vocab, config, meta)
     72     # We should accept both dot notation and nested dict here for consistency
     73     config = util.dot_to_dict(config)
---> 74     return LangClass.from_config(config, vocab=vocab, meta=meta)

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy/language.py in from_config(cls, config, vocab, disable, exclude, meta, auto_fill, validate)
   1747         # then we would load them twice at runtime: once when we make from config,
   1748         # and then again when we load from disk.
-> 1749         nlp = lang_cls(vocab=vocab, create_tokenizer=create_tokenizer, meta=meta)
   1750         if after_creation is not None:
   1751             nlp = after_creation(nlp)

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy/language.py in __init__(self, vocab, max_length, meta, create_tokenizer, batch_size, **kwargs)
    188             tokenizer_cfg = {"tokenizer": self._config["nlp"]["tokenizer"]}
    189             create_tokenizer = registry.resolve(tokenizer_cfg)["tokenizer"]
--> 190         self.tokenizer = create_tokenizer(self)
    191         self.batch_size = batch_size
    192         self.default_error_handler = raise_error

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy_stanza/tokenizer.py in tokenizer_factory(nlp, lang, dir, package, processors, logging_level, verbose, use_gpu, kwargs)
     35         if dir is None:
     36             dir = DEFAULT_MODEL_DIR
---> 37         snlp = Pipeline(
     38             lang=lang,
     39             dir=dir,

~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/pipeline/core.py in __init__(self, lang, dir, package, processors, logging_level, verbose, use_gpu, model_dir, **kwargs)
    161                     if processor_name not in resources[lang]:
    162                         # user asked for a model which doesn't exist for this language?
--> 163                         raise UnsupportedProcessorError(processor_name, lang)
    164                     if not os.path.exists(model_path):
    165                         model_name, _ = os.path.splitext(model_name)

UnsupportedProcessorError: Processor ner is not known for language et.  If you have created your own model, please specify the ner_model_path parameter when creating the pipeline.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
Hassan-Eldeebcommented, Mar 30, 2022

Thank you for the helpful responses. They helped me.

0reactions
Hassan-Eldeebcommented, Mar 31, 2022

Luckily, one of the authors of EstBERT is has joined us, and she will continue to work on this task. Once we finish and our company approves publishing this work, I will let you know.

Thanks again for your valuable advice.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Presidio - translation from Estonian to English with examples ...
Translation of «Presidio» from Estonian to English.
Read more >
Presidio - Kau Manor
Presidio has a rich cultural history spanning back to the time of the native Ohlone people. In 1776, the Spanish arrived to establish...
Read more >
Estonian Voice Over Services in Presidio, TX - 24 HRS TAT
Professional voices for your Estonian voice over services in Presidio, TX. Our Voice Over Company has many capable voice over artists and actors...
Read more >
Presidio: Home
Presidio is a global digital solutions and services provider delivering software-defined cloud, collaboration and security solutions to customers of all ...
Read more >
Presidio Parkway - SFCTA
The Presidio Parkway serves as a regional gateway between the Golden Gate Bridge and the City of San Francisco. The parkway is a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found