question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

💫 Generic lemmatization & morphology

See original GitHub issue

Simple and generic lemmatization functionality can be implemented as a look-up table. The lemmatization lists here look good: http://www.lexiconista.com/datasets/lemmatization/

To make this work, we need a simple class that behaves similarly to the existing spacy.lemmatizer.Lemmatizer class. A language subclass can then create this lookup lemmatizer in its Language.Defaults.create_lemmatizer() method.

This will give us a low-effort start on lemmatization in lots of languages, that can be replaced by more sophisticated strategies on a case-by-case basis.

Wishlist extension

It would be very interesting to try a sequence-to-sequence model to generate the lemmas, using the existing lists as training. The sequence-to-sequence model would then be used when the lookup fails. I think this might perform quite well, especially if a POS tag can be supplied as a feature.

To be specific, the sequences are the characters, and the problem is analogous to neural machine translation. So, example NMT architectures would be the best place to start.

Details

  • Difficulty: easy / suitable for new contributors. The extension is more complicated and requires some ML experience.
  • Files to change: lemmatizer.py, language.py, possibly morphology.pyx
  • Tests: tests/lemmatizer (not created yet)
  • Related issues: #390

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:3
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
TerminalWitchcraftcommented, Feb 26, 2017

Already started to implement!

0reactions
lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stemming and lemmatization - Stanford NLP Group
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove ...
Read more >
BioLemmatizer: a lemmatization tool for morphological ... - NCBI
The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further ......
Read more >
A simple multilingual lemmatizer for Python - Bits of Language
The Python library Simplemma provides a simple and multilingual approach to look for base forms or lemmata, it currently supports 35 languages.
Read more >
On Lemmatization and Morphological Tagging for Highly ...
The current thesis mainly focuses on the lemmatization problem which ... The Transformation process between word to lemma is shaped into a generic....
Read more >
simplemma - PyPI
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found