Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`AutoTokenizer` not enforcing `use_fast=True`

See original GitHub issue

This issue is about AutoTokenizer not enforcing use_fast=True.

This works:

$ python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/opt-13b', use_fast=True); \
assert t.is_fast, 'tokenizer is not fast'; print('Success')" 
Success

now the same code, but a different model ‘facebook/opt-1.3b’ that doesn’t have a fast optimizer:

$ python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/opt-1.3b', use_fast=True); \
assert t.is_fast, 'tokenizer is not fast'; print('Success')" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: tokenizer is not fast

now the doc says:

use_fast (bool, optional, defaults to True) — Whether or not to try to load the fast version of the tokenizer.

so it sort of hints with “try to load” that it won’t enforce it. But would you be open to a less ambiguous definition? something like:

use_fast (bool, optional, defaults to True) — Will try to load the fast version of the tokenizer if there is one and 
will quietly fallback onto the normal (slower) tokenizer if the model doesn't provide a fast one.

I think the use_fast arg name is ambiguous - I’d have renamed it to try_to_use_fast since currently if one must use the fast tokenizer one has to additionally check that that AutoTokenizer.from_pretrained returned the slow version.

not sure, open to suggestions.

context: in m4 the codebase currently requires a fast tokenizer.

Thank you!

cc: @ArthurZucker

Issue Analytics

State:
Created 9 months ago
Comments:5 (5 by maintainers)

Top GitHub Comments

2reactions

ArthurZuckercommented, Dec 20, 2022

It is indeed a bug, the facebook/opt-1.3b tokenizer config is missing the tokenizer_type variable. And the use_fast argument is not passed down properly in that case. The fix is here #20823

1reaction

ArthurZuckercommented, Dec 20, 2022

Agreed, the problem is now the inconsistency between two models. If it is only OPT related we can leave it as is, otherwise will have a look

Top Results From Across the Web

Huggingface AutoTokenizer cannot be referenced when ...

I am trying to import AutoTokenizer and AutoModelWithLMHead, ...

Fast tokenizers' special powers - Hugging Face

Since the AutoTokenizer class picks a fast tokenizer by default, ... there are 9 labels: O is the label for the tokens that...

Working with Hugging Face Transformers and TF 2.0

I am assuming that you are aware of Transformers and its attention mechanism. The primary aim of this blog is to show how...

PyTorch-Transformers

... the AutoModel and AutoTokenizer classes of the pytorch-transformers library. ... the two sentences are not paraphrasing each other # Or get the...

How to use the transformers.AutoTokenizer function in ... - Snyk

AutoTokenizer examples, based on popular ways it is used in public projects. ... is None and labels is not None: # {'class1': 0,...