💫 Generic lemmatization & morphology
See original GitHub issueSimple and generic lemmatization functionality can be implemented as a look-up table. The lemmatization lists here look good: http://www.lexiconista.com/datasets/lemmatization/
To make this work, we need a simple class that behaves similarly to the existing spacy.lemmatizer.Lemmatizer
class. A language subclass can then create this lookup lemmatizer in its Language.Defaults.create_lemmatizer()
method.
This will give us a low-effort start on lemmatization in lots of languages, that can be replaced by more sophisticated strategies on a case-by-case basis.
Wishlist extension
It would be very interesting to try a sequence-to-sequence model to generate the lemmas, using the existing lists as training. The sequence-to-sequence model would then be used when the lookup fails. I think this might perform quite well, especially if a POS tag can be supplied as a feature.
To be specific, the sequences are the characters, and the problem is analogous to neural machine translation. So, example NMT architectures would be the best place to start.
Details
- Difficulty: easy / suitable for new contributors. The extension is more complicated and requires some ML experience.
- Files to change:
lemmatizer.py
,language.py
, possiblymorphology.pyx
- Tests:
tests/lemmatizer
(not created yet) - Related issues: #390
Issue Analytics
- State:
- Created 7 years ago
- Reactions:3
- Comments:8 (5 by maintainers)
Top GitHub Comments
Already started to implement!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.