question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Latin Lemmatizer raises an error on unknown form ?

See original GitHub issue

Code :

from cltk.stem.lemma import LemmaReplacer
lemmatizer = LemmaReplacer('latin')
sentence = "Commius autem sive expiato suo dolore sive magna parte amissa suorum legatos ad Antonium mittit seque et ibi futurum, ubi praescripserit, et ea facturum, quae imperarit, obsidibus firmat;"  # Caesar 8.48.8
lemmatizer.lemmatize(sentence)

Output

File "/home/thibault/dev/latin-topic-recognition/venv/lib/python3.4/site-packages/cltk/stem/lemma.py", line 72, in lemmatize
    headword = self.lemmata[token.lower()]
KeyError: 'antonium'

Expected Output and Discussion

I am not expecting the lemmatizer to raise KeyError on unknown words/forms. Worst case, I would expect an option at the instantiation (LemmaReplacer(raise_on_missing=False)) which would output either the original form, a stemmed form or “unknown” (I obviously prefer the original form).

Would it be possible I am doing something wrong ?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
PonteIneptiquecommented, Jun 22, 2016

Thanks a lot. I also think that a lemmatizer stack is a great idea. Mostly because we could put probabilistic one first (with fewer forms registered generally) and lookup one after 😃 Gj

0reactions
diyclassicscommented, Jun 20, 2016

@PonteIneptique You can see some preliminary work and testing here: https://github.com/diyclassics/cltk/tree/gsoc/cltk/lemmatize

Read more comments on GitHub >

github_iconTop Results From Across the Web

lemmatizer - Disiecta Membra
Top 10 lemmas in Latin Library: LEMMA COUNT TYPE-LEM % RUNNING % 1. et 446474 3.29% ... as the current lemmatizer can raise...
Read more >
Lemmatization for Ancient Greek in - Brill
We focus here on lemmatization, the process in which each word form in a text is associated to its lemma. In the case...
Read more >
Automatic Lemmatizer Construction with Focus on OOV Words ...
ABSTRACT Unknown words, or out of vocabulary words (OOV), cause a significant problem to morphological analysers, syntactic parses, MT systems and other NLP ......
Read more >
Improving Lemmatization of Non-Standard Languages with ...
Lemmatization of standard languages is con- cerned with (i) abstracting over morphologi- cal differences and (ii) resolving token-lemma.
Read more >
Lemmatization and morphological analysis for the Latin ...
Abstract. The present article presents some challenges posed by lemmatization and PoS tagging of Latin, with reference to the ongoing work to revise...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found