question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnigramLemmatizer.choose_tag returns '', not None, for certain words, short-circuiting BackoffGreekLemmatizer

See original GitHub issue

Describe the bug

The backoff Greek lemmatizer is supposed to use the following chain of sub-lemmatizers (ref):

  1. DictLemmatizer
  2. UnigramLemmatizer
  3. RegexpLemmatizer
  4. DictLemmatizer
  5. IdentityLemmatizer

The SequentialBackoffLemmatizer superclass tries each sub-lemmatizer in turn, moving to the next whenever one returns None (ref). But certain words cause UnigramLemmatizer to return '', not None, which causes SequentialBackoffLemmatizer to return '' as the lemma, without trying RegexpLemmatizer and the other following sub-lemmatizers. One such word is 'διοτρεφές'.

>>> from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer
>>> from cltk.corpus.utils.formatter import cltk_normalize
>>> word = cltk_normalize('διοτρεφές')
>>> lemmatizer = BackoffGreekLemmatizer()
>>> lemmatizer.lemmatize([word])
[('διοτρεφές', '')]

By walking the chain of sub-lemmatizers manually, we find that it is UnigramLemmatizer (backoff4) that returns an empty string. If it returned None, then backoff would continue and eventually find a lemma with backoff2.

>>> lemmatizer.backoff5.choose_tag([word], 0, None)
>>> lemmatizer.backoff4.choose_tag([word], 0, None)
''
>>> lemmatizer.backoff3.choose_tag([word], 0, None)
>>> lemmatizer.backoff2.choose_tag([word], 0, None)
'διοτρεφής'
>>> lemmatizer.backoff1.choose_tag([word], 0, None)
'διοτρεφές'
>>> lemmatizer.backoff4
<UnigramLemmatizer: CLTK Sentence Training Data>

I cannot find a place in the inheritance chain of UnigramLemmatizer (through UnigramTagger, NgramTagger, and ContextTagger) that explicitly returns an empty string, so I suppose it must be happening somewhere in the model.

To Reproduce

Steps to reproduce the behavior:

  1. Install Python version 3.7.3.
  2. Install CLTK version 0.1.121 with greek_models_cltk at commit a68b983734d34df16fd49661f11c4ea037ab173a.
    python3 -m venv venv
    source venv/bin/activate
    pip3 install cltk
    python3 -c 'from cltk.corpus.utils.importer import CorpusImporter; CorpusImporter('greek').import_corpus('greek_models_cltk')'
    
  3. In a script or REPL, run the following code:
    >>> from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer
    >>> from cltk.corpus.utils.formatter import cltk_normalize
    >>> word = cltk_normalize('διοτρεφές')
    >>> lemmatizer = BackoffGreekLemmatizer()
    >>> lemmatizer.lemmatize([word])
    
  4. See error:
    [('διοτρεφές', '')]
    

Expected behavior

[('διοτρεφές', 'διοτρεφής')]

Desktop (please complete the following information):

Debian 10.7

Additional context

UnigramLemmatizer does not return '' for all strings. For example, giving it a string of non-Greek text falls all the way through to IdentityLemmatizer, as expected.

>>> lemmatizer.lemmatize(['foo'])
[('foo', 'foo')]

This is how we are using the lemmatizer. We are inserting our own DictLemmatizer with corrections we have found in our corpus at the beginning of the backoff chain.

https://github.com/sasansom/sedes/blob/85ba9e2a2b5e9fbf52655368451b6057922582a6/src/lemma.py#L206-L216

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
diyclassicscommented, Feb 9, 2021

So it’s looking like this issue may belong instead in https://github.com/cltk/grc_models_cltk. Should I copy it there, or is here the best place?

Thank you for catching this—working on it now. You raise a number of good points that I am looking into.

First…

  • I think you should copy over the issue of blank (i.e. ‘’) lemmas to grc_models_cltk. I will update the token-lemma pair sentences; for now, I will just remove pairs with missing lemmas. (I will then turn some attention to putting together an updated list of sentences and releasing a new version of this lemmatizer.)

I will also make two updates to the lemmatizer code itself…

  1. Look into adding a check for blank (‘’) lemmas to UnigramLemmatizer
  2. Change the value in, as you refer to it, the subsetting step from 4000 (sentences, I believe—I will check) to 90% of the sents. (Also, when I have a chance to update the sentences, I can create an official train-test split in the models repo.)

Again, thank you for bringing these to our attention—looking forward to working with you all more on improving these tools/models.

0reactions
kylepjohnsoncommented, Apr 30, 2021

I have merged the 3 PRs associated with this and bumped version to 1.0.12:

@whoopsedesy thanks again for your patience while we worked on this one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found