Error while saving a variation of roberta-base fast tokenizer vocabulary
See original GitHub issueInformation
Unable to save ‘ufal/robeczech-base’ fast tokenizer, which is a variation of roberta. I have tried the same minimal example (see below) with non-fast tokenizer and it worked fine.
Error message with a RUST_BACKTRACE=1
:
thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66
stack backtrace:
0: rust_begin_unwind
at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5
1: core::panicking::panic_fmt
at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/panicking.rs:92:14
2: core::option::expect_failed
at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/option.rs:1321:5
3: serde::ser::Serializer::collect_map
4: <tokenizers::models::bpe::model::BPE as tokenizers::tokenizer::Model>::save
5: <tokenizers::models::ModelWrapper as tokenizers::tokenizer::Model>::save
6: tokenizers::models::PyModel::save
7: tokenizers::models::__init2250971146856332535::__init2250971146856332535::__wrap::{{closure}}
8: tokenizers::models::__init2250971146856332535::__init2250971146856332535::__wrap
9: _PyMethodDef_RawFastCallKeywords
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:694:23
10: _PyCFunction_FastCallKeywords
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:734:14
11: call_function
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4568:9
12: _PyEval_EvalFrameDefault
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
13: _PyEval_EvalCodeWithName
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
14: _PyFunction_FastCallKeywords
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
15: call_function
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
16: _PyEval_EvalFrameDefault
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
17: _PyEval_EvalCodeWithName
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
18: _PyFunction_FastCallKeywords
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
19: call_function
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
20: _PyEval_EvalFrameDefault
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
21: _PyEval_EvalCodeWithName
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
22: _PyFunction_FastCallKeywords
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
23: call_function
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
24: _PyEval_EvalFrameDefault
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3110:23
25: _PyEval_EvalCodeWithName
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
26: PyEval_EvalCodeEx
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3959:12
27: PyEval_EvalCode
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:524:12
28: run_mod
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:1035:9
29: PyRun_InteractiveOneObjectEx
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:256:9
30: PyRun_InteractiveLoopFlags
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:120:15
31: PyRun_AnyFileExFlags
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:78:19
32: pymain_run_file
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:427:11
33: pymain_run_filename
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:1606:22
34: pymain_run_python
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:2867:9
35: pymain_main
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:3028:5
36: _Py_UnixMain
at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:3063:12
37: __libc_start_main
38: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2034, in save_pretrained
filename_prefix=filename_prefix,
File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 567, in _save_pretrained
vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 177, in save_vocabulary
files = self._tokenizer.model.save(save_directory, name=filename_prefix)
pyo3_runtime.PanicException: no entry found for key
Environment info
transformers
version: 4.10.0- Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-centos-7.8.2003-Core
- Python version: 3.7.4
- PyTorch version (GPU?): 1.9.0+cu102 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: False
- Using distributed or parallel set-up in script?: False
Who can help
@patrickvonplaten, @LysandreJik.
To reproduce
- Import model and tokenizer:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
- Save the tokenizer:
tokenizer.save_pretrained('./')
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
RoBERTa - Hugging Face
Construct a “fast” RoBERTa tokenizer (backed by HuggingFace's tokenizers library), derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
Read more >Load a pre-trained model from disk with Huggingface ...
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.
Read more >BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick
This post will explain how you can modify and fine-tune BERT to create a powerful NLP model that quickly gives you state of...
Read more >Importing Hugging Face models into Spark NLP
tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical- ... After saving the model, you also need to add the vocab.txt file to the ...
Read more >Tutorial: How to train a RoBERTa Language Model for Spanish
txt"] while BertWordPieceTokenizer produces only 1 file vocab.txt , it will cause an error if we use BertWordPieceTokenizer to load outputs of a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey guys,
At the moment, it seems like we will have to fall back to the slow tokenizer for this one:
works.
I found the the problem is from using fast tokenizer. so I turned it of using flag
--use_fast_tokenizer=False
, and it is ok. Though it is not solution I want.