unk_id is missing for SentencepieceTokenizer
See original GitHub issueTrained sentencepiece tokenizer from Tokenizer library with some added tokens
"added_tokens": [
{
"id": 0,
"special": true,
"content": "<unk>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
},
{
"id": 1,
"special": true,
"content": "<s>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
},
{
"id": 2,
"special": true,
"content": "</s>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
},
{
"id": 3,
"special": true,
"content": "<cls>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
},
{
"id": 4,
"special": true,
"content": "<sep>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
},
{
"id": 5,
"special": true,
"content": "<pad>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
},
{
"id": 6,
"special": true,
"content": "<mask>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
},
{
"id": 7,
"special": true,
"content": "<eod>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
},
{
"id": 8,
"special": true,
"content": "<eop>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
}
],
When this tokenizer encounters unknown words, it throws unk_id is missing error.
Exception: Encountered an unknown token but
unk_id is missing
How do I set unk_id for this tokenizer?
This is how I load my tokenizer
tokenizer = Tokenizer.from_file('./unigram.json')
tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer, unk_token="<unk>")
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
SentencePiece Tokenizer Demystified | by Jonathan Kernes
It's actually a method for selecting tokens from a precompiled list, optimizing the tokenization process based on a supplied corpus.
Read more >Sentencepiece Tokenizer With Offsets For T5, ALBERT, XLM ...
In this video I show you how to use Google's implementation of Sentencepiece tokenizer for question and answering systems.
Read more >SentencePieceTokenizer - Keras
A SentencePiece tokenizer layer. This layer provides an implementation of SentencePiece tokenization as described in the SentencePiece paper and the ...
Read more >Summary of the tokenizers - Hugging Face
More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......
Read more >torchtext.transforms - PyTorch
Transform for Sentence Piece tokenizer from pre-trained sentencepiece model ... from torchtext.transforms import SentencePieceTokenizer >>> transform ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks! Manually changing unk_id to 0 make it works
Thanks for the document, I think the error comes from the fact that the value associated to the
model
key and theunk_id
subkey in yourunigram.json
file isnull
instead of0
("<unk>"
is the first token of your vocabulary).If you have used the training script of
SentencePieceUnigramTokenizer
provided in 🤗 tokenizers library, I’ve opened a PR here to solve this missing information for the future trainings - as theunk_token
need to be pass to theTrainer
and currently you haven’t the opportunity to do it. 🙂However, I guess you don’t want to re-train your tokenizer. In this case, the simplest is to change by hand the value in the
unigram.json
file associated to theunk_id
key so that it matches the id of the unknown token (in your case0
).I would be happy to know if this indeed solves the error you had ☺️