`fill-mask` pipeline cannot load tokenizer's `config.json` (fixed in 4.8.0)
See original GitHub issueEnvironment info
transformers
version: 4.7.0- Platform: Linux-5.4.0-74-generic-x86_64-with-glibc2.31
- Python version: 3.9.5
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using: RoBERTa
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: see details below
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: see details below
To reproduce
Following official notebook to train from scratch RoBERTa (tokenizer and model alike). The only addition is to save the RoBERTa tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained("/path/to/BPE/tokenizer", return_special_tokens_mask=True, model_max_length=32) # BPE tokenizer previously trained using the tokenizer library, as per docs, then vocab and merges loaded from transfromers' RobertaTokenizerFast
tokenizer.save_pretrained("/path/to/roberta_tk") # resaving the tokenizer, full model now
Saving outputs the following:
('/path/to/roberta_tk/tokenizer_config.json',
'/path/to/roberta_tk/special_tokens_map.json',
'/path/to/roberta_tk/vocab.json',
'/path/to/roberta_tk/merges.txt',
'/path/to/roberta_tk/added_tokens.json',
'/path/to/roberta_tk/tokenizer.json')
Note that there is no config.json
file, only tokenizer_config.json
Then try to load the tokenizer:
fill_mask = pipeline(
"fill-mask",
model="/path/to/model",
tokenizer="/path/to/roberta_tk"
)
Errors out, complaining that config.json
is missing. Symlinking tokenizer_config.json
to config.json
solves the issues.
Expected behavior
File name match between tokenizer save output and pipeline input.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (6 by maintainers)
Top Results From Across the Web
Pipeline fill-mask error with custom Roberta tokenizer
I think it's looking for a config.json file in the tokenizer folder but the BPE tokenizer is only outputting vocab.json and merges.txt files....
Read more >Pipeline fill-mask error with custom Roberta tokenizer
I think it's looking for a config.json file in the tokenizer folder but the BPE tokenizer is only outputting vocab.json and merges.txt files....
Read more >Building KantaiBERT from scratch - Kaggle
Loading the Trained Tokenizer Files from tokenizers.implementations import ... We won't load it. loading file . ... content/KantaiBERT/config.json not found.
Read more >pip install transformers==2.5.1 - PyPI
Quick tour: pipelines, Using Pipelines: Wrapper around tokenizer and models ... config.json [--filename folder/foobar.json] # ^^ Upload a single file # (you ...
Read more >Transformers v4.0.0: Fast tokenizers, model outputs, file ...
AutoTokenizers and pipelines now use fast (rust) tokenizers by default. The python and rust tokenizers have roughly the same API, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Version v4.8.0 on PyPi is indeed ok to use and should work perfectly well for the fill-mask pipeline. 😃
Gotcha, thank you!