Latin Lemmatizer raises an error on unknown form ?
See original GitHub issueCode :
from cltk.stem.lemma import LemmaReplacer
lemmatizer = LemmaReplacer('latin')
sentence = "Commius autem sive expiato suo dolore sive magna parte amissa suorum legatos ad Antonium mittit seque et ibi futurum, ubi praescripserit, et ea facturum, quae imperarit, obsidibus firmat;" # Caesar 8.48.8
lemmatizer.lemmatize(sentence)
Output
File "/home/thibault/dev/latin-topic-recognition/venv/lib/python3.4/site-packages/cltk/stem/lemma.py", line 72, in lemmatize
headword = self.lemmata[token.lower()]
KeyError: 'antonium'
Expected Output and Discussion
I am not expecting the lemmatizer to raise KeyError on unknown words/forms. Worst case, I would expect an option at the instantiation (LemmaReplacer(raise_on_missing=False)
) which would output either the original form, a stemmed form or “unknown” (I obviously prefer the original form).
Would it be possible I am doing something wrong ?
Issue Analytics
- State:
- Created 7 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
lemmatizer - Disiecta Membra
Top 10 lemmas in Latin Library: LEMMA COUNT TYPE-LEM % RUNNING % 1. et 446474 3.29% ... as the current lemmatizer can raise...
Read more >Lemmatization for Ancient Greek in - Brill
We focus here on lemmatization, the process in which each word form in a text is associated to its lemma. In the case...
Read more >Automatic Lemmatizer Construction with Focus on OOV Words ...
ABSTRACT Unknown words, or out of vocabulary words (OOV), cause a significant problem to morphological analysers, syntactic parses, MT systems and other NLP ......
Read more >Improving Lemmatization of Non-Standard Languages with ...
Lemmatization of standard languages is con- cerned with (i) abstracting over morphologi- cal differences and (ii) resolving token-lemma.
Read more >Lemmatization and morphological analysis for the Latin ...
Abstract. The present article presents some challenges posed by lemmatization and PoS tagging of Latin, with reference to the ongoing work to revise...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks a lot. I also think that a lemmatizer stack is a great idea. Mostly because we could put probabilistic one first (with fewer forms registered generally) and lookup one after 😃 Gj
@PonteIneptique You can see some preliminary work and testing here: https://github.com/diyclassics/cltk/tree/gsoc/cltk/lemmatize