Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

unk_id is missing for SentencepieceTokenizer

See original GitHub issue

Trained sentencepiece tokenizer from Tokenizer library with some added tokens

 "added_tokens": [
    {
      "id": 0,
      "special": true,
      "content": "<unk>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 1,
      "special": true,
      "content": "<s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 2,
      "special": true,
      "content": "</s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 3,
      "special": true,
      "content": "<cls>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 4,
      "special": true,
      "content": "<sep>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 5,
      "special": true,
      "content": "<pad>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 6,
      "special": true,
      "content": "<mask>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 7,
      "special": true,
      "content": "<eod>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 8,
      "special": true,
      "content": "<eop>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    }
  ],

When this tokenizer encounters unknown words, it throws unk_id is missing error.

Exception: Encountered an unknown token but unk_id is missing How do I set unk_id for this tokenizer?

This is how I load my tokenizer

tokenizer = Tokenizer.from_file('./unigram.json')
tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer, unk_token="<unk>")

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

darwinhariantocommented, Jul 20, 2021

Thanks! Manually changing unk_id to 0 make it works

1reaction

SaulLucommented, Jul 19, 2021

Thanks for the document, I think the error comes from the fact that the value associated to the model key and the unk_id subkey in your unigram.json file is null instead of 0 ("<unk>" is the first token of your vocabulary).

If you have used the training script of SentencePieceUnigramTokenizer provided in 🤗 tokenizers library, I’ve opened a PR here to solve this missing information for the future trainings - as the unk_token need to be pass to the Trainer and currently you haven’t the opportunity to do it. 🙂

However, I guess you don’t want to re-train your tokenizer. In this case, the simplest is to change by hand the value in the unigram.json file associated to the unk_id key so that it matches the id of the unknown token (in your case 0).

I would be happy to know if this indeed solves the error you had ☺️

Top Results From Across the Web

SentencePiece Tokenizer Demystified | by Jonathan Kernes

It's actually a method for selecting tokens from a precompiled list, optimizing the tokenization process based on a supplied corpus.

Sentencepiece Tokenizer With Offsets For T5, ALBERT, XLM ...

In this video I show you how to use Google's implementation of Sentencepiece tokenizer for question and answering systems.

SentencePieceTokenizer - Keras

A SentencePiece tokenizer layer. This layer provides an implementation of SentencePiece tokenization as described in the SentencePiece paper and the ...

Summary of the tokenizers - Hugging Face

More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......

torchtext.transforms - PyTorch

Transform for Sentence Piece tokenizer from pre-trained sentencepiece model ... from torchtext.transforms import SentencePieceTokenizer >>> transform ...