question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error while saving a variation of roberta-base fast tokenizer vocabulary

See original GitHub issue

Information

Unable to save ‘ufal/robeczech-base’ fast tokenizer, which is a variation of roberta. I have tried the same minimal example (see below) with non-fast tokenizer and it worked fine.

Error message with a RUST_BACKTRACE=1:

thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/panicking.rs:92:14
   2: core::option::expect_failed
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/option.rs:1321:5
   3: serde::ser::Serializer::collect_map
   4: <tokenizers::models::bpe::model::BPE as tokenizers::tokenizer::Model>::save
   5: <tokenizers::models::ModelWrapper as tokenizers::tokenizer::Model>::save
   6: tokenizers::models::PyModel::save
   7: tokenizers::models::__init2250971146856332535::__init2250971146856332535::__wrap::{{closure}}
   8: tokenizers::models::__init2250971146856332535::__init2250971146856332535::__wrap
   9: _PyMethodDef_RawFastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:694:23
  10: _PyCFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:734:14
  11: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4568:9
  12: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  13: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  14: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  15: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  16: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  17: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  18: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  19: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  20: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  21: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  22: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  23: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  24: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3110:23
  25: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  26: PyEval_EvalCodeEx
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3959:12
  27: PyEval_EvalCode
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:524:12
  28: run_mod
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:1035:9
  29: PyRun_InteractiveOneObjectEx
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:256:9
  30: PyRun_InteractiveLoopFlags
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:120:15
  31: PyRun_AnyFileExFlags
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:78:19
  32: pymain_run_file
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:427:11
  33: pymain_run_filename
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:1606:22
  34: pymain_run_python
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:2867:9
  35: pymain_main
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:3028:5
  36: _Py_UnixMain
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:3063:12
  37: __libc_start_main
  38: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2034, in save_pretrained
    filename_prefix=filename_prefix,
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 567, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 177, in save_vocabulary
    files = self._tokenizer.model.save(save_directory, name=filename_prefix)
pyo3_runtime.PanicException: no entry found for key

Environment info

  • transformers version: 4.10.0
  • Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-centos-7.8.2003-Core
  • Python version: 3.7.4
  • PyTorch version (GPU?): 1.9.0+cu102 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: False
  • Using distributed or parallel set-up in script?: False

Who can help

@patrickvonplaten, @LysandreJik.

To reproduce

  1. Import model and tokenizer:
from transformers import AutoTokenizer, AutoModelForMaskedLM  
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")  
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
  1. Save the tokenizer:
tokenizer.save_pretrained('./')

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
patrickvonplatencommented, Dec 7, 2021

Hey guys,

At the moment, it seems like we will have to fall back to the slow tokenizer for this one:

  1. Import model and tokenizer:
from transformers import AutoTokenizer, AutoModelForMaskedLM  
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base", use_fast=False)
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
  1. Save the tokenizer:
tokenizer.save_pretrained('./')

works.

1reaction
shudiptacommented, Dec 4, 2021

I found the the problem is from using fast tokenizer. so I turned it of using flag --use_fast_tokenizer=False, and it is ok. Though it is not solution I want.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RoBERTa - Hugging Face
Construct a “fast” RoBERTa tokenizer (backed by HuggingFace's tokenizers library), derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
Read more >
Load a pre-trained model from disk with Huggingface ...
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.
Read more >
BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick
This post will explain how you can modify and fine-tune BERT to create a powerful NLP model that quickly gives you state of...
Read more >
Importing Hugging Face models into Spark NLP
tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical- ... After saving the model, you also need to add the vocab.txt file to the ...
Read more >
Tutorial: How to train a RoBERTa Language Model for Spanish
txt"] while BertWordPieceTokenizer produces only 1 file vocab.txt , it will cause an error if we use BertWordPieceTokenizer to load outputs of a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found