Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problems and errors in German lemmatizer

See original GitHub issue

How to reproduce the behaviour

import spacy
nlp = spacy.load('de')
test = nlp.tokenizer('die Versicherungen') # The insuranceS
for t in test:
    print(t,t.lemma_)
[output] die der 
[output] Versicherungen Versicherung

test = nlp.tokenizer('Die Versicherungen') # The insuranceS
for t in test:
    print(t,t.lemma_)
[output] Die Die
[output] Versicherungen Versicherung

test = nlp.tokenizer('die versicherungen') # The insuranceS
for t in test:
    print(t,t.lemma_)
[output] die der
[output] versicherungen versicherungen

Your Environment

Python version: 3.5.2
Models: de
Platform: Linux-4.4.0-112-generic-x86_64-with-Ubuntu-16.04-xenial
spaCy version: 2.0.11

Hi all,

I hope the code snippet exemplifies the problem clearly enough.

Basically, I fail to see how the German lemmatization should be used.

Nouns are only lemmatized if they are Capitalized, all other text elements are only lemmatized if they are lower case. So turning all words to lower() means throwing away all nouns lemmas. Trusting the input to have proper capitalization means losing all cases where a non-noun is at the beginning of a sencence (hence not lower case).

How do people actually use this in a real use-case?

Thanks for your help,

Andrea.

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:16 (10 by maintainers)

Top GitHub Comments

3reactions

DuyguAcommented, Jul 19, 2018

I think lookup with POS tag will solve the majority of the issues.

Btw, if you want to experiment with my lemmatizer design here he is:

https://github.com/DuyguA/DEMorphy

You can find the list of accompanying morphological dictionaries in the repo as well.

In case if you need German languahe resources you can always contact me and my colleagues at Parlamind. We’re more than happy to help.

2reactions

inescommented, Jul 6, 2018

Making this the master issue for everything related to the German lemmatizer, so copying over the other comments and test cases. We’re currently planning out various improvements to the rule-based lemmatizer, and strategies to replace the lookup tables with rules wherever possible.

#2368

doc = nlp(u'Ich sehe Bäume')

for token in doc:
    print(token.text,token.lemma, token.lemma_, token.pos_)
    print("has_vector:", token.has_vector)

doc = nlp("Diese Auskünfte muss ich dir nicht geben.")
[token.lemma_ for token in doc]
# ['Diese', 'Auskunft', 'muss', 'ich', 'sich', 'nicht', 'geben', '.']

#2120

The German lemmatizer currently only uses a lookup table – that’s fine for some cases, but obviously not as good as a solution that takes part-of-speech tags into account.

You might want to check out #2079, which discusses a solution for implementing a custom lemmatizer in French – either based on spaCy’s English lemmatization rules, or by implementing a third-party library via a custom pipeline component.

One quick note on the expected lemmatization / tokenization:

=> unter, der, Tisch, spinnen, klapperdürr, Holzwurm

spaCy’s German tokenization rules currently don’t split contractions like “unterm”. One reason is that spaCy will never modify the original ORTH value of the tokens – so "unterm" would have to become ["unter", "m"], where the token “m” will have the NORM “dem”. Those single-letter tokens can easily lead to confusion, which is why we’ve opted to not produce them for now. But if your treebank or expected tokenization requires contractions to be split, you can easily add your own special case rules:

import spacy
from spacy.symbols import ORTH, NORM, LEMMA

nlp = spacy.load('de')

special_case = [{ORTH: 'unter'}, {ORTH: 'm', NORM: 'dem', LEMMA: 'der'}]
nlp.tokenizer.add_special_case('unterm', special_case)

We don’t have an immediate plan or timeline yet, but we’d definitely love to move from lookup lemmatization to rule-based or statistical lemmatization in the future. (Shipping the tables with spaCy really adds a lot of bloat and it comes with all kinds of other problems.)

Top Results From Across the Web

Developers - Problems and errors in German lemmatizer -

How to reproduce the behaviour. import spacy nlp = spacy.load('de') test = nlp.tokenizer('die Versicherungen') # The insuranceS for t in test: print(t ...

How to Lemmatize German Words with NLP-Spacy Lemmatizer?

Based on the lemmatization results above, Lemmatizer. SpaCy can show the token, lemma, and PoS-tag form of a word in German, although there...

Lemmatization using Spacy on German text gives wrong output

I have a csv file with 70k german tweets and want to lemmatize the text, but it reduces all tweets to only one...

The Durm German Lemmatizer | semanticsoftware.info

The Durm German Lemmatization System consists of a number of GATE components ... allows it to self-correct some errors that are introduced by...

Stemming and lemmatization - Stanford NLP Group

Getting better value from term normalization depends more on pragmatic issues of word use than on formal issues of linguistic morphology. The situation...