question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`fill-mask` pipeline cannot load tokenizer's `config.json` (fixed in 4.8.0)

See original GitHub issue

Environment info

  • transformers version: 4.7.0
  • Platform: Linux-5.4.0-74-generic-x86_64-with-glibc2.31
  • Python version: 3.9.5
  • PyTorch version (GPU?): 1.9.0+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@sgugger @LysandreJik

Information

Model I am using: RoBERTa

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: see details below

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: see details below

To reproduce

Following official notebook to train from scratch RoBERTa (tokenizer and model alike). The only addition is to save the RoBERTa tokenizer

tokenizer = RobertaTokenizerFast.from_pretrained("/path/to/BPE/tokenizer", return_special_tokens_mask=True, model_max_length=32)  # BPE tokenizer previously trained using the tokenizer library, as per docs, then vocab and merges loaded from transfromers' RobertaTokenizerFast

tokenizer.save_pretrained("/path/to/roberta_tk")  # resaving the tokenizer, full model now

Saving outputs the following:

('/path/to/roberta_tk/tokenizer_config.json',
 '/path/to/roberta_tk/special_tokens_map.json',
 '/path/to/roberta_tk/vocab.json',
 '/path/to/roberta_tk/merges.txt',
 '/path/to/roberta_tk/added_tokens.json',
 '/path/to/roberta_tk/tokenizer.json')

Note that there is no config.json file, only tokenizer_config.json

Then try to load the tokenizer:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer="/path/to/roberta_tk"
)

Errors out, complaining that config.json is missing. Symlinking tokenizer_config.json to config.json solves the issues.

Expected behavior

File name match between tokenizer save output and pipeline input.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Jun 24, 2021

Version v4.8.0 on PyPi is indeed ok to use and should work perfectly well for the fill-mask pipeline. 😃

1reaction
rspreafico-abscicommented, Jun 23, 2021

Gotcha, thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pipeline fill-mask error with custom Roberta tokenizer
I think it's looking for a config.json file in the tokenizer folder but the BPE tokenizer is only outputting vocab.json and merges.txt files....
Read more >
Pipeline fill-mask error with custom Roberta tokenizer
I think it's looking for a config.json file in the tokenizer folder but the BPE tokenizer is only outputting vocab.json and merges.txt files....
Read more >
Building KantaiBERT from scratch - Kaggle
Loading the Trained Tokenizer Files from tokenizers.implementations import ... We won't load it. loading file . ... content/KantaiBERT/config.json not found.
Read more >
pip install transformers==2.5.1 - PyPI
Quick tour: pipelines, Using Pipelines: Wrapper around tokenizer and models ... config.json [--filename folder/foobar.json] # ^^ Upload a single file # (you ...
Read more >
Transformers v4.0.0: Fast tokenizers, model outputs, file ...
AutoTokenizers and pipelines now use fast (rust) tokenizers by default. The python and rust tokenizers have roughly the same API, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found