Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

broken models on the hub

See original GitHub issue

Go to https://huggingface.co/sshleifer/distill-mbart-en-ro-12-6 click on “use in transformers”, copy-n-paste and nope can’t use this in transformers:

python -c 'from transformers import AutoTokenizer; AutoTokenizer.from_pretrained("sshleifer/distill-mbart-en-ro-12-6")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/tokenization_auto.py", line 410, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 1704, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 1717, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 1776, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/roberta/tokenization_roberta.py", line 159, in __init__
    super().__init__(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/gpt2/tokenization_gpt2.py", line 179, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

this is with the latest master.

These for example I tested to work fine:

sshleifer/distill-mbart-en-ro-12-4
sshleifer/distill-mbart-en-ro-12-9

Perhaps we need a sort of CI that goes over the public models, validates that run in transformers code succeeds and sends an alert if it doesn’t? We have no idea how many other models are broken on the hub right now.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:10 (9 by maintainers)

Top GitHub Comments

2reactions

julien-ccommented, Mar 18, 2021

Note that this is not necessarily a low hanging fruit (depending on your definition of a low hanging fruit 😂) given that:

we have 7,000+ models whose total weights represent multiple TBs of data
they change over time

1reaction

stas00commented, Mar 18, 2021

I meant that just loading a model / tokenizer is cheaper/faster/requires almost 0 extra code to write - hence low-hanging fruit.

I hear you that the hub is huge, a little bit at a time. It would have been the same code to validate 10 models or 7K models if there is no urgency to complete it fast, it just would take much much longer to complete.

they change over time

That was exactly my point, they and the codebase too, so it’s not enough to check it once, even if we track when it was changed and when it was validated last.