Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

💫 Generic lemmatization & morphology

See original GitHub issue

Simple and generic lemmatization functionality can be implemented as a look-up table. The lemmatization lists here look good: http://www.lexiconista.com/datasets/lemmatization/

To make this work, we need a simple class that behaves similarly to the existing spacy.lemmatizer.Lemmatizer class. A language subclass can then create this lookup lemmatizer in its Language.Defaults.create_lemmatizer() method.

This will give us a low-effort start on lemmatization in lots of languages, that can be replaced by more sophisticated strategies on a case-by-case basis.

Wishlist extension

It would be very interesting to try a sequence-to-sequence model to generate the lemmas, using the existing lists as training. The sequence-to-sequence model would then be used when the lookup fails. I think this might perform quite well, especially if a POS tag can be supplied as a feature.

To be specific, the sequences are the characters, and the problem is analogous to neural machine translation. So, example NMT architectures would be the best place to start.

Details

Difficulty: easy / suitable for new contributors. The extension is more complicated and requires some ML experience.
Files to change: lemmatizer.py, language.py, possibly morphology.pyx
Tests: tests/lemmatizer (not created yet)
Related issues: #390

Issue Analytics

State:
Created 7 years ago
Reactions:3
Comments:8 (5 by maintainers)

Top GitHub Comments

4reactions

TerminalWitchcraftcommented, Feb 26, 2017

Already started to implement!

0reactions

lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

Stemming and lemmatization - Stanford NLP Group

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove ...

BioLemmatizer: a lemmatization tool for morphological ... - NCBI

The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further ......

A simple multilingual lemmatizer for Python - Bits of Language

The Python library Simplemma provides a simple and multilingual approach to look for base forms or lemmata, it currently supports 35 languages.

On Lemmatization and Morphological Tagging for Highly ...

The current thesis mainly focuses on the lemmatization problem which ... The Transformation process between word to lemma is shaped into a generic....

simplemma - PyPI

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified...