Some community models are broken and can't be downloaded
See original GitHub issueš Bug
Information
Model I am using (Bert, XLNet ā¦): Community Models
Language I am using the model on (English, Chinese ā¦): Multiple different ones
Quite some community models canāt be loaded. The stats are here:
Stats
- 68 canāt load either their config (n)or their tokenizer:
-
a) 34 models canāt even load their config file. The reasons for this are either:
-
i. 11/34: Model identifier is wrong, e.g.
albert-large
does not exist anymore, it seems like it was renamed toalbert-large-v1
. These models have saved the another name online than how it is saved on AWS. -
ii. 23/34: There is an unrecognized
model_type
in the config.json,e.g.
-
"Error: Message: Unrecognized model in hfl/rbtl3. Should have a
model_type
key in its config.json, or contain one of the following strings in its name: t5, distilbert, albert, camembert, xlm-roberta, bart, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl "
- b) 33 models can load their config, but cannot load their tokenizers. The error message is almost always the same:
TOK ERROR: clue/roberta_chinese_base tokenizer can not be loaded Message: Model name āclue/roberta_chinese_baseā was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed āclue/roberta_chinese_baseā was a path, a model identifier, or url to a directory containing vocabulary files named [āvocab.jsonā, āmerges.txtā] but couldnāt find such vocabulary files at this path or url.
- i. Here: the model has neither of:
-
vocab_file
-added_tokens_file
-special_tokens_map_file
-tokenizer_config_file
-
79 currently have wrong
pad_token_id
,eos_token_id
,bos_token_id
in their configs. IMPORTANT: The reason for this is that we used to have the wrong defaults saved inPretrainedConfig()
- see e.g. here the default value for any model forpad_token_id
was 0. People trained a model with the lib, saved it and the resulting config.json now had apad_token_id = 0
saved. This was then uploaded. But itās wrong and should be corrected. -
For 162 models everything is fine!
Here the full analysis log here Here the code that created this log (simple comparison of loaded tokenizer and config with default config): here
HOW-TO-FIX-STEPS (in the following order):
-
Fix 1 a) i. first: All models that have a wrong model identifier path should get the correct one. Need to update some model identifier paths on
https://huggingface.co/models
like changingbertabs-finetuned-xsum-extractive-abstractive-summarization
toremi/bertabs-finetuned-xsum-extractive-abstractive-summarization
. Some of those errors are very weird, see #3358 -
Fix 1 a) ii. shoud be quite easy to add the correct
model_type
to the config.json -
Fix 1 b) Not sure how to fix the lacking tokenizer files most efficiently @julien-c
-
Fix 2) Create automated script that:
-
If tokenizer.pad_token_id != default_config.pad_token_id
->config.pad_token_id = tokenizer.pad_token_id else
removepad_token_id
.
-
- Removes all
eos_token_ids
-> they donāt exist anymore
- Removes all
Issue Analytics
- State:
- Created 4 years ago
- Reactions:5
- Comments:6 (3 by maintainers)
When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add āmodel_typeā:ābertā to the configuration file, It worked, but I donāt know if itās reasonable.
Hi, @XiangQinYu. Iām a bit of a newbie with Huggingface. Can you say more about how you did this? I guess you mean adding āmodel_typeā:ābertā to a file like this. But how did you edit the file? Did you download the whole model repository, and edit and run it locally?
EDIT: Nevermind, figured it out with help of a commenter on a question I asked on SO.