Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some community models are broken and can't be downloaded

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Community Models

Language I am using the model on (English, Chinese …): Multiple different ones

Quite some community models can’t be loaded. The stats are here:

Stats

68 can’t load either their config (n)or their tokenizer:

a) 34 models can’t even load their config file. The reasons for this are either:
- i. 11/34: Model identifier is wrong, e.g. albert-large does not exist anymore, it seems like it was renamed to albert-large-v1. These models have saved the another name online than how it is saved on AWS.
- ii. 23/34: There is an unrecognized model_type in the config.json, e.g.

"Error: Message: Unrecognized model in hfl/rbtl3. Should have a model_type key in its config.json, or contain one of the following strings in its name: t5, distilbert, albert, camembert, xlm-roberta, bart, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl "

b) 33 models can load their config, but cannot load their tokenizers. The error message is almost always the same:

TOK ERROR: clue/roberta_chinese_base tokenizer can not be loaded Message: Model name ‘clue/roberta_chinese_base’ was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed ‘clue/roberta_chinese_base’ was a path, a model identifier, or url to a directory containing vocabulary files named [‘vocab.json’, ‘merges.txt’] but couldn’t find such vocabulary files at this path or url.

i. Here: the model has neither of: - vocab_file - added_tokens_file - special_tokens_map_file - tokenizer_config_file

79 currently have wrong pad_token_id, eos_token_id, bos_token_id in their configs. IMPORTANT: The reason for this is that we used to have the wrong defaults saved in PretrainedConfig() - see e.g. here the default value for any model for pad_token_id was 0. People trained a model with the lib, saved it and the resulting config.json now had a pad_token_id = 0 saved. This was then uploaded. But it’s wrong and should be corrected.
For 162 models everything is fine!

Here the full analysis log here Here the code that created this log (simple comparison of loaded tokenizer and config with default config): here

HOW-TO-FIX-STEPS (in the following order):

Fix 1 a) i. first: All models that have a wrong model identifier path should get the correct one. Need to update some model identifier paths on https://huggingface.co/models like changing bertabs-finetuned-xsum-extractive-abstractive-summarization to remi/bertabs-finetuned-xsum-extractive-abstractive-summarization. Some of those errors are very weird, see #3358
Fix 1 a) ii. shoud be quite easy to add the correct model_type to the config.json
Fix 1 b) Not sure how to fix the lacking tokenizer files most efficiently @julien-c
Fix 2) Create automated script that:
1. If tokenizer.pad_token_id != default_config.pad_token_id -> config.pad_token_id = tokenizer.pad_token_id else remove pad_token_id.
1. Removes all eos_token_ids -> they don’t exist anymore

Issue Analytics

State:
Created 4 years ago
Reactions:5
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

XiangQinYucommented, May 29, 2021

When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add “model_type”:“bert” to the configuration file, It worked, but I don’t know if it’s reasonable.

0reactions

drussellmrichiecommented, Aug 23, 2021

When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add “model_type”:“bert” to the configuration file, It worked, but I don’t know if it’s reasonable.

Hi, @XiangQinYu. I’m a bit of a newbie with Huggingface. Can you say more about how you did this? I guess you mean adding “model_type”:“bert” to a file like this. But how did you edit the file? Did you download the whole model repository, and edit and run it locally?

EDIT: Nevermind, figured it out with help of a commenter on a question I asked on SO.