Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`fill-mask` pipeline cannot load tokenizer's `config.json` (fixed in 4.8.0)

See original GitHub issue

Environment info

transformers version: 4.7.0
Platform: Linux-5.4.0-74-generic-x86_64-with-glibc2.31
Python version: 3.9.5
PyTorch version (GPU?): 1.9.0+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@sgugger @LysandreJik

Information

Model I am using: RoBERTa

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: see details below

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: see details below

To reproduce

Following official notebook to train from scratch RoBERTa (tokenizer and model alike). The only addition is to save the RoBERTa tokenizer

tokenizer = RobertaTokenizerFast.from_pretrained("/path/to/BPE/tokenizer", return_special_tokens_mask=True, model_max_length=32)  # BPE tokenizer previously trained using the tokenizer library, as per docs, then vocab and merges loaded from transfromers' RobertaTokenizerFast

tokenizer.save_pretrained("/path/to/roberta_tk")  # resaving the tokenizer, full model now

Saving outputs the following:

('/path/to/roberta_tk/tokenizer_config.json',
 '/path/to/roberta_tk/special_tokens_map.json',
 '/path/to/roberta_tk/vocab.json',
 '/path/to/roberta_tk/merges.txt',
 '/path/to/roberta_tk/added_tokens.json',
 '/path/to/roberta_tk/tokenizer.json')

Note that there is no config.json file, only tokenizer_config.json

Then try to load the tokenizer:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer="/path/to/roberta_tk"
)

Errors out, complaining that config.json is missing. Symlinking tokenizer_config.json to config.json solves the issues.

Expected behavior

File name match between tokenizer save output and pipeline input.

Issue Analytics

State:
Created 2 years ago
Comments:13 (6 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Jun 24, 2021

Version v4.8.0 on PyPi is indeed ok to use and should work perfectly well for the fill-mask pipeline. 😃

1reaction

rspreafico-abscicommented, Jun 23, 2021

Gotcha, thank you!

Top Results From Across the Web

Pipeline fill-mask error with custom Roberta tokenizer

I think it's looking for a config.json file in the tokenizer folder but the BPE tokenizer is only outputting vocab.json and merges.txt files....

Pipeline fill-mask error with custom Roberta tokenizer

I think it's looking for a config.json file in the tokenizer folder but the BPE tokenizer is only outputting vocab.json and merges.txt files....

Building KantaiBERT from scratch - Kaggle

Loading the Trained Tokenizer Files from tokenizers.implementations import ... We won't load it. loading file . ... content/KantaiBERT/config.json not found.

pip install transformers==2.5.1 - PyPI

Quick tour: pipelines, Using Pipelines: Wrapper around tokenizer and models ... config.json [--filename folder/foobar.json] # ^^ Upload a single file # (you ...

Transformers v4.0.0: Fast tokenizers, model outputs, file ...

AutoTokenizers and pipelines now use fast (rust) tokenizers by default. The python and rust tokenizers have roughly the same API, ...