UnigramLemmatizer.choose_tag returns '', not None, for certain words, short-circuiting BackoffGreekLemmatizer
See original GitHub issueDescribe the bug
The backoff Greek lemmatizer is supposed to use the following chain of sub-lemmatizers (ref):
DictLemmatizer
UnigramLemmatizer
RegexpLemmatizer
DictLemmatizer
IdentityLemmatizer
The SequentialBackoffLemmatizer
superclass tries each sub-lemmatizer in turn, moving to the next whenever one returns None
(ref). But certain words cause UnigramLemmatizer
to return ''
, not None
, which causes SequentialBackoffLemmatizer
to return ''
as the lemma, without trying RegexpLemmatizer
and the other following sub-lemmatizers. One such word is 'διοτρεφές'
.
>>> from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer
>>> from cltk.corpus.utils.formatter import cltk_normalize
>>> word = cltk_normalize('διοτρεφές')
>>> lemmatizer = BackoffGreekLemmatizer()
>>> lemmatizer.lemmatize([word])
[('διοτρεφές', '')]
By walking the chain of sub-lemmatizers manually, we find that it is UnigramLemmatizer
(backoff4
) that returns an empty string. If it returned None
, then backoff would continue and eventually find a lemma with backoff2
.
>>> lemmatizer.backoff5.choose_tag([word], 0, None)
>>> lemmatizer.backoff4.choose_tag([word], 0, None)
''
>>> lemmatizer.backoff3.choose_tag([word], 0, None)
>>> lemmatizer.backoff2.choose_tag([word], 0, None)
'διοτρεφής'
>>> lemmatizer.backoff1.choose_tag([word], 0, None)
'διοτρεφές'
>>> lemmatizer.backoff4
<UnigramLemmatizer: CLTK Sentence Training Data>
I cannot find a place in the inheritance chain of UnigramLemmatizer
(through UnigramTagger
, NgramTagger
, and ContextTagger
) that explicitly returns an empty string, so I suppose it must be happening somewhere in the model.
To Reproduce
Steps to reproduce the behavior:
- Install Python version 3.7.3.
- Install CLTK version 0.1.121 with greek_models_cltk at commit a68b983734d34df16fd49661f11c4ea037ab173a.
python3 -m venv venv source venv/bin/activate pip3 install cltk python3 -c 'from cltk.corpus.utils.importer import CorpusImporter; CorpusImporter('greek').import_corpus('greek_models_cltk')'
- In a script or REPL, run the following code:
>>> from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer >>> from cltk.corpus.utils.formatter import cltk_normalize >>> word = cltk_normalize('διοτρεφές') >>> lemmatizer = BackoffGreekLemmatizer() >>> lemmatizer.lemmatize([word])
- See error:
[('διοτρεφές', '')]
Expected behavior
[('διοτρεφές', 'διοτρεφής')]
Desktop (please complete the following information):
Debian 10.7
Additional context
UnigramLemmatizer
does not return ''
for all strings. For example, giving it a string of non-Greek text falls all the way through to IdentityLemmatizer
, as expected.
>>> lemmatizer.lemmatize(['foo'])
[('foo', 'foo')]
This is how we are using the lemmatizer. We are inserting our own DictLemmatizer
with corrections we have found in our corpus at the beginning of the backoff chain.
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (11 by maintainers)
Thank you for catching this—working on it now. You raise a number of good points that I am looking into.
First…
I will also make two updates to the lemmatizer code itself…
Again, thank you for bringing these to our attention—looking forward to working with you all more on improving these tools/models.
I have merged the 3 PRs associated with this and bumped version to
1.0.12
:@whoopsedesy thanks again for your patience while we worked on this one.