question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

unk_id is missing for SentencepieceTokenizer

See original GitHub issue

Trained sentencepiece tokenizer from Tokenizer library with some added tokens

 "added_tokens": [
    {
      "id": 0,
      "special": true,
      "content": "<unk>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 1,
      "special": true,
      "content": "<s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 2,
      "special": true,
      "content": "</s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 3,
      "special": true,
      "content": "<cls>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 4,
      "special": true,
      "content": "<sep>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 5,
      "special": true,
      "content": "<pad>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 6,
      "special": true,
      "content": "<mask>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 7,
      "special": true,
      "content": "<eod>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    },
    {
      "id": 8,
      "special": true,
      "content": "<eop>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false
    }
  ],

When this tokenizer encounters unknown words, it throws unk_id is missing error.

Exception: Encountered an unknown token but unk_id is missing How do I set unk_id for this tokenizer?

This is how I load my tokenizer

tokenizer = Tokenizer.from_file('./unigram.json')
tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer, unk_token="<unk>")

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
darwinhariantocommented, Jul 20, 2021

Thanks! Manually changing unk_id to 0 make it works

1reaction
SaulLucommented, Jul 19, 2021

Thanks for the document, I think the error comes from the fact that the value associated to the model key and the unk_id subkey in your unigram.json file is null instead of 0 ("<unk>" is the first token of your vocabulary).

If you have used the training script of SentencePieceUnigramTokenizer provided in 🤗 tokenizers library, I’ve opened a PR here to solve this missing information for the future trainings - as the unk_token need to be pass to the Trainer and currently you haven’t the opportunity to do it. 🙂

However, I guess you don’t want to re-train your tokenizer. In this case, the simplest is to change by hand the value in the unigram.json file associated to the unk_id key so that it matches the id of the unknown token (in your case 0).

I would be happy to know if this indeed solves the error you had ☺️

Read more comments on GitHub >

github_iconTop Results From Across the Web

SentencePiece Tokenizer Demystified | by Jonathan Kernes
It's actually a method for selecting tokens from a precompiled list, optimizing the tokenization process based on a supplied corpus.
Read more >
Sentencepiece Tokenizer With Offsets For T5, ALBERT, XLM ...
In this video I show you how to use Google's implementation of Sentencepiece tokenizer for question and answering systems.
Read more >
SentencePieceTokenizer - Keras
A SentencePiece tokenizer layer. This layer provides an implementation of SentencePiece tokenization as described in the SentencePiece paper and the ...
Read more >
Summary of the tokenizers - Hugging Face
More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......
Read more >
torchtext.transforms - PyTorch
Transform for Sentence Piece tokenizer from pre-trained sentencepiece model ... from torchtext.transforms import SentencePieceTokenizer >>> transform ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found