Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

A way to tell what tokens `LatinBackOffLemmatizer()` has failed to lemmatize

See original GitHub issue

In LatinBackOffLemmatizer() and the lemmatizers in its chain I can’t seem to find an option to return an empty value (such as in OldEnglishDictionaryLemmatizer()'s best_guess=False option), instead of returning the input value, when the lemmatizer fails to assign a lemma.

Without such an option, it doesn’t seem possible to tell successful from unsuccessful lemmatization attempts programmatically, severely limiting the range of the lemmatizer’s applications.

Issue Analytics

State:
Created 9 months ago
Comments:6 (5 by maintainers)

Top GitHub Comments

2reactions

clemsciencescommented, Dec 20, 2022

Ok, I made something that might help you.

self.backoff1 is now a DefaultLemmatizer instance that returns None if no result was found.

import os
import re
from typing import List

from cltk.lemmatize.backoff import (
    DefaultLemmatizer,
    DictLemmatizer,
    IdentityLemmatizer,
    RegexpLemmatizer,
    UnigramLemmatizer,
)
from cltk.utils import CLTK_DATA_DIR
from cltk.utils.file_operations import open_pickle
from cltk.lemmatize.lat import *

models_path = os.path.normpath(
    os.path.join(CLTK_DATA_DIR, "lat/model/lat_models_cltk/lemmata/backoff")
)



class CustomLatinBackoffLemmatizer:
    """Suggested backoff chain; includes at least on of each
    type of major sequential backoff class from backoff.py

    """

    def __init__(
        self: object, train: List[list] = None, seed: int = 3, verbose: bool = False
    ):
        self.models_path = models_path

        missing_models_message = "LatinBackoffLemmatizer requires the ```latin_models_cltk``` to be in cltk_data. Please load this corpus."

        try:
            self.train = open_pickle(
                os.path.join(self.models_path, "latin_pos_lemmatized_sents.pickle")
            )
            self.LATIN_OLD_MODEL = open_pickle(
                os.path.join(self.models_path, "latin_lemmata_cltk.pickle")
            )
            self.LATIN_MODEL = open_pickle(
                os.path.join(self.models_path, "latin_model.pickle")
            )
        except FileNotFoundError as err:
            raise type(err)(missing_models_message)

        self.latin_sub_patterns = latin_sub_patterns  # Move to latin_models_cltk

        self.seed = seed
        self.VERBOSE = verbose

        def _randomize_data(train: List[list], seed: int):
            import random

            random.seed(seed)
            random.shuffle(train)
            train_size = int(0.9 * len(train))
            pos_train_sents = train[:train_size]
            lem_train_sents = [[(item[0], item[1]) for item in sent] for sent in train]
            train_sents = lem_train_sents[:train_size]
            test_sents = lem_train_sents[train_size:]

            return pos_train_sents, train_sents, test_sents

        self.pos_train_sents, self.train_sents, self.test_sents = _randomize_data(
            self.train, self.seed
        )
        self._define_lemmatizer()

    def _define_lemmatizer(self: object):
        # Suggested backoff chain--should be tested for optimal order
        self.backoff1 = DefaultLemmatizer(verbose=self.VERBOSE)
        self.backoff2 = DictLemmatizer(
            lemmas=self.LATIN_OLD_MODEL,
            source="Morpheus Lemmas",
            backoff=self.backoff1,
            verbose=self.VERBOSE,
        )
        self.backoff3 = RegexpLemmatizer(
            self.latin_sub_patterns,
            source="CLTK Latin Regex Patterns",
            backoff=self.backoff2,
            verbose=self.VERBOSE,
        )
        self.backoff4 = UnigramLemmatizer(
            self.train_sents,
            source="CLTK Sentence Training Data",
            backoff=self.backoff3,
            verbose=self.VERBOSE,
        )
        self.backoff5 = DictLemmatizer(
            lemmas=self.LATIN_MODEL,
            source="Latin Model",
            backoff=self.backoff4,
            verbose=self.VERBOSE,
        )
        self.lemmatizer = self.backoff5

    def lemmatize(self: object, tokens: List[str]):
        lemmas = self.lemmatizer.lemmatize(tokens)
        return lemmas

    def evaluate(self: object):
        if self.VERBOSE:
            raise AssertionError(
                "evaluate() method only works when verbose: bool = False"
            )
        return self.lemmatizer.evaluate(self.test_sents)

    def __repr__(self: object):
        return f"<CustomLatinBackoffLemmatizer>"

    def __call__(self, token: str) -> str:
        return self.lemmatize([token])[0][0]

And you get:

lemmatizer = CustomLatinBackoffLemmatizer()
list(lemmatizer.lemmatize('arma virumque cano euhhhh'.split()))

[('arma', 'arma'),
 ('virumque', 'vir'),
 ('cano', 'cano'),
 ('euhhhh', None)]

1reaction

clemsciencescommented, Dec 20, 2022

By the way the suggested solution is not optimal at all, I could have made a child class of LatinBackoffLemmatizer that just changes _define_lemmatizer.


from cltk.lemmatize.backoff import (
    DefaultLemmatizer,
    DictLemmatizer,
    IdentityLemmatizer,
    RegexpLemmatizer,
    UnigramLemmatizer,
)

from cltk.lemmatize.lat import LatinBackoffLemmatizer


class CustomLatinBackoffLemmatizer(LatinBackoffLemmatizer):
    
    def _define_lemmatizer(self: object):
        # Suggested backoff chain--should be tested for optimal order
        self.backoff1 = DefaultLemmatizer(verbose=self.VERBOSE)
        self.backoff2 = DictLemmatizer(
            lemmas=self.LATIN_OLD_MODEL,
            source="Morpheus Lemmas",
            backoff=self.backoff1,
            verbose=self.VERBOSE,
        )
        self.backoff3 = RegexpLemmatizer(
            self.latin_sub_patterns,
            source="CLTK Latin Regex Patterns",
            backoff=self.backoff2,
            verbose=self.VERBOSE,
        )
        self.backoff4 = UnigramLemmatizer(
            self.train_sents,
            source="CLTK Sentence Training Data",
            backoff=self.backoff3,
            verbose=self.VERBOSE,
        )
        self.backoff5 = DictLemmatizer(
            lemmas=self.LATIN_MODEL,
            source="Latin Model",
            backoff=self.backoff4,
            verbose=self.VERBOSE,
        )
        self.lemmatizer = self.backoff5

    def __repr__(self: object):
        return f"<CustomLatinBackoffLemmatizer>"

lemmatizer = CustomLatinBackoffLemmatizer()

>>> list(lemmatizer.lemmatize('arma virumque cano euhhhh'.split()))
[('arma', 'arma'), ('virumque', 'vir'), ('cano', 'cano'), ('euhhhh', None)]

Top Results From Across the Web

8.1.8. cltk.lemmatize package - The Classical Language Toolkit

Tagging of individual words is performed by the choose_tag() method, which should be ... If a tagger is unable to determine a tag...

NLTK Lemmatization: How to Lemmatize Words with NLTK?

Create the tokens with “word_tokenize” from the text. Lemmatize the tokens with “lemmatizer.lemmatize()”. Append the lemmas into a list. Unite ...

spacy 3.0 warnings about lemmatization and POS · Issue #5036

[W108] The rule-based lemmatizer did not find POS annotation for the token 'hesitated'. Check that your pipeline includes components that assign ...

Lemmatization Approaches with Examples in Python

Lemmatization is the process of converting a word to its base form. ... We will see how to optimally implement and compare the...

Is Spacy lemmatization not working properly or does it not ...

The spaCy lemmatizer is not failing, it's performing as expected. ... See how the verb has "consult" as a lemma, while the noun...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

A way to tell what tokens `LatinBackOffLemmatizer()` has failed to lemmatize

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

A signed Squirrel binary can be used to execute untrusted code

Help Requested for Embeddings Trained on Specific Texts