Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error while saving a variation of roberta-base fast tokenizer vocabulary

See original GitHub issue

Information

Unable to save ‘ufal/robeczech-base’ fast tokenizer, which is a variation of roberta. I have tried the same minimal example (see below) with non-fast tokenizer and it worked fine.

Error message with a `RUST_BACKTRACE=1`:

thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/panicking.rs:92:14
   2: core::option::expect_failed
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/option.rs:1321:5
   3: serde::ser::Serializer::collect_map
   4: <tokenizers::models::bpe::model::BPE as tokenizers::tokenizer::Model>::save
   5: <tokenizers::models::ModelWrapper as tokenizers::tokenizer::Model>::save
   6: tokenizers::models::PyModel::save
   7: tokenizers::models::__init2250971146856332535::__init2250971146856332535::__wrap::{{closure}}
   8: tokenizers::models::__init2250971146856332535::__init2250971146856332535::__wrap
   9: _PyMethodDef_RawFastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:694:23
  10: _PyCFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:734:14
  11: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4568:9
  12: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  13: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  14: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  15: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  16: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  17: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  18: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  19: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  20: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  21: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  22: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  23: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  24: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3110:23
  25: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  26: PyEval_EvalCodeEx
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3959:12
  27: PyEval_EvalCode
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:524:12
  28: run_mod
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:1035:9
  29: PyRun_InteractiveOneObjectEx
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:256:9
  30: PyRun_InteractiveLoopFlags
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:120:15
  31: PyRun_AnyFileExFlags
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:78:19
  32: pymain_run_file
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:427:11
  33: pymain_run_filename
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:1606:22
  34: pymain_run_python
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:2867:9
  35: pymain_main
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:3028:5
  36: _Py_UnixMain
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:3063:12
  37: __libc_start_main
  38: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2034, in save_pretrained
    filename_prefix=filename_prefix,
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 567, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 177, in save_vocabulary
    files = self._tokenizer.model.save(save_directory, name=filename_prefix)
pyo3_runtime.PanicException: no entry found for key

Environment info

transformers version: 4.10.0
Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-centos-7.8.2003-Core
Python version: 3.7.4
PyTorch version (GPU?): 1.9.0+cu102 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: False
Using distributed or parallel set-up in script?: False

Who can help

@patrickvonplaten, @LysandreJik.

To reproduce

Import model and tokenizer:

from transformers import AutoTokenizer, AutoModelForMaskedLM  
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")  
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")

Save the tokenizer:

tokenizer.save_pretrained('./')

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

patrickvonplatencommented, Dec 7, 2021

Hey guys,

At the moment, it seems like we will have to fall back to the slow tokenizer for this one:

Import model and tokenizer:

from transformers import AutoTokenizer, AutoModelForMaskedLM  
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base", use_fast=False)
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")

Save the tokenizer:

tokenizer.save_pretrained('./')

works.

1reaction

shudiptacommented, Dec 4, 2021

I found the the problem is from using fast tokenizer. so I turned it of using flag --use_fast_tokenizer=False, and it is ok. Though it is not solution I want.

Top Results From Across the Web

RoBERTa - Hugging Face

Construct a “fast” RoBERTa tokenizer (backed by HuggingFace's tokenizers library), derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.

Load a pre-trained model from disk with Huggingface ...

- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.

BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick

This post will explain how you can modify and fine-tune BERT to create a powerful NLP model that quickly gives you state of...

Importing Hugging Face models into Spark NLP

tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical- ... After saving the model, you also need to add the vocab.txt file to the ...

Tutorial: How to train a RoBERTa Language Model for Spanish

txt"] while BertWordPieceTokenizer produces only 1 file vocab.txt , it will cause an error if we use BertWordPieceTokenizer to load outputs of a...