Presidio with Estonian
See original GitHub issueDescribe the bug I am trying to use presidio with the Estonian language. The Estonian language is supported by stanza (see this link). However, I see this error:
UnsupportedProcessorError: Processor ner is not known for language et. If you have created your own model, please specify the ner_model_path parameter when creating the pipeline.
The Estonian model is not a ner model, but I don’t know how to change processors.
To Reproduce This is my code
import stanza
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
# stanza.download("et") # One time download
configuration = { "nlp_engine_name": "stanza", "models": [{"lang_code": "et", "model_name": "et"}] }
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_stanza = provider.create_engine()
analyzer = AnalyzerEngine(nlp_engine=nlp_engine_with_stanza)
# Analyze
results= analyzer.analyze(text="Minu nimi on Raphael", language="et", return_decision_process=True)
print(results)
print(results[0].analysis_explanation)
Additional context Here are the full error log:
2022-02-06 12:58:19 INFO: Loading these models for language: et (Estonian):
=======================
| Processor | Package |
-----------------------
| tokenize | edt |
| pos | edt |
| lemma | edt |
| ner | default |
=======================
2022-02-06 12:58:19 INFO: Use device: gpu
2022-02-06 12:58:19 INFO: Loading: tokenize
2022-02-06 12:58:19 INFO: Loading: pos
2022-02-06 12:58:20 INFO: Loading: lemma
2022-02-06 12:58:20 INFO: Loading: ner
2022-02-06 12:58:20 ERROR: Cannot load model from /data/home/hassan.eldeeb/stanza_resources/et/ner/default.pt
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/pipeline/core.py in __init__(self, lang, dir, package, processors, logging_level, verbose, use_gpu, model_dir, **kwargs)
140 # try to build processor, throw an exception if there is a requirements issue
--> 141 self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
142 pipeline=self,
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/pipeline/processor.py in __init__(self, config, pipeline, use_gpu)
158 if not hasattr(self, '_variant'):
--> 159 self._set_up_model(config, use_gpu)
160
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/pipeline/ner_processor.py in _set_up_model(self, config, use_gpu)
26 'charlm_backward_file': config.get('backward_charlm_path', None)}
---> 27 self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu)
28
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/models/ner/trainer.py in __init__(self, args, vocab, pretrain, model_file, use_cuda, train_classifier_only)
51 # load everything from file
---> 52 self.load(model_file, args)
53 else:
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/models/ner/trainer.py in load(self, filename, args)
136 try:
--> 137 checkpoint = torch.load(filename, lambda storage, loc: storage)
138 except BaseException:
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
593
--> 594 with _open_file_like(f, 'rb') as opened_file:
595 if _is_zipfile(opened_file):
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/torch/serialization.py in _open_file_like(name_or_buffer, mode)
229 if _is_path(name_or_buffer):
--> 230 return _open_file(name_or_buffer, mode)
231 else:
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/torch/serialization.py in __init__(self, name, mode)
210 def __init__(self, name, mode):
--> 211 super(_open_file, self).__init__(open(name, mode))
212
FileNotFoundError: [Errno 2] No such file or directory: '/data/home/hassan.eldeeb/stanza_resources/et/ner/default.pt'
During handling of the above exception, another exception occurred:
UnsupportedProcessorError Traceback (most recent call last)
/tmp/ipykernel_2501/3553649706.py in <module>
9
10 provider = NlpEngineProvider(nlp_configuration=configuration)
---> 11 nlp_engine_with_stanza = provider.create_engine()
12
13 analyzer = AnalyzerEngine(nlp_engine=nlp_engine_with_stanza, supported_languages=["et"])
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/presidio_analyzer/nlp_engine/nlp_engine_provider.py in create_engine(self)
79 for m in self.nlp_configuration["models"]
80 }
---> 81 engine = nlp_engine_class(nlp_engine_opts)
82 logger.info(
83 f"Created NLP engine: {engine.engine_name}. "
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/presidio_analyzer/nlp_engine/stanza_nlp_engine.py in __init__(self, models)
32 logger.debug(f"Loading Stanza models: {models.values()}")
33
---> 34 self.nlp = {
35 lang_code: spacy_stanza.load_pipeline(
36 model_name,
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/presidio_analyzer/nlp_engine/stanza_nlp_engine.py in <dictcomp>(.0)
33
34 self.nlp = {
---> 35 lang_code: spacy_stanza.load_pipeline(
36 model_name,
37 processors="tokenize,pos,lemma,ner",
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy_stanza/__init__.py in load_pipeline(name, lang, dir, package, processors, logging_level, verbose, use_gpu, **kwargs)
48 config["nlp"]["tokenizer"]["use_gpu"] = use_gpu
49 config["nlp"]["tokenizer"]["kwargs"].update(kwargs)
---> 50 return blank(name, config=config)
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy/__init__.py in blank(name, vocab, config, meta)
72 # We should accept both dot notation and nested dict here for consistency
73 config = util.dot_to_dict(config)
---> 74 return LangClass.from_config(config, vocab=vocab, meta=meta)
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy/language.py in from_config(cls, config, vocab, disable, exclude, meta, auto_fill, validate)
1747 # then we would load them twice at runtime: once when we make from config,
1748 # and then again when we load from disk.
-> 1749 nlp = lang_cls(vocab=vocab, create_tokenizer=create_tokenizer, meta=meta)
1750 if after_creation is not None:
1751 nlp = after_creation(nlp)
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy/language.py in __init__(self, vocab, max_length, meta, create_tokenizer, batch_size, **kwargs)
188 tokenizer_cfg = {"tokenizer": self._config["nlp"]["tokenizer"]}
189 create_tokenizer = registry.resolve(tokenizer_cfg)["tokenizer"]
--> 190 self.tokenizer = create_tokenizer(self)
191 self.batch_size = batch_size
192 self.default_error_handler = raise_error
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/spacy_stanza/tokenizer.py in tokenizer_factory(nlp, lang, dir, package, processors, logging_level, verbose, use_gpu, kwargs)
35 if dir is None:
36 dir = DEFAULT_MODEL_DIR
---> 37 snlp = Pipeline(
38 lang=lang,
39 dir=dir,
~/.conda/envs/text_anonymization/lib/python3.8/site-packages/stanza/pipeline/core.py in __init__(self, lang, dir, package, processors, logging_level, verbose, use_gpu, model_dir, **kwargs)
161 if processor_name not in resources[lang]:
162 # user asked for a model which doesn't exist for this language?
--> 163 raise UnsupportedProcessorError(processor_name, lang)
164 if not os.path.exists(model_path):
165 model_name, _ = os.path.splitext(model_name)
UnsupportedProcessorError: Processor ner is not known for language et. If you have created your own model, please specify the ner_model_path parameter when creating the pipeline.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Presidio - translation from Estonian to English with examples ...
Translation of «Presidio» from Estonian to English.
Read more >Presidio - Kau Manor
Presidio has a rich cultural history spanning back to the time of the native Ohlone people. In 1776, the Spanish arrived to establish...
Read more >Estonian Voice Over Services in Presidio, TX - 24 HRS TAT
Professional voices for your Estonian voice over services in Presidio, TX. Our Voice Over Company has many capable voice over artists and actors...
Read more >Presidio: Home
Presidio is a global digital solutions and services provider delivering software-defined cloud, collaboration and security solutions to customers of all ...
Read more >Presidio Parkway - SFCTA
The Presidio Parkway serves as a regional gateway between the Golden Gate Bridge and the City of San Francisco. The parkway is a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank you for the helpful responses. They helped me.
Luckily, one of the authors of EstBERT is has joined us, and she will continue to work on this task. Once we finish and our company approves publishing this work, I will let you know.
Thanks again for your valuable advice.