`AutoTokenizer` not enforcing `use_fast=True`
See original GitHub issueThis issue is about AutoTokenizer
not enforcing use_fast=True
.
This works:
$ python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/opt-13b', use_fast=True); \
assert t.is_fast, 'tokenizer is not fast'; print('Success')"
Success
now the same code, but a different model ‘facebook/opt-1.3b’ that doesn’t have a fast optimizer:
$ python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/opt-1.3b', use_fast=True); \
assert t.is_fast, 'tokenizer is not fast'; print('Success')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
AssertionError: tokenizer is not fast
now the doc says:
use_fast (bool, optional, defaults to True) — Whether or not to try to load the fast version of the tokenizer.
so it sort of hints with “try to load” that it won’t enforce it. But would you be open to a less ambiguous definition? something like:
use_fast (bool, optional, defaults to True) — Will try to load the fast version of the tokenizer if there is one and
will quietly fallback onto the normal (slower) tokenizer if the model doesn't provide a fast one.
I think the use_fast
arg name is ambiguous - I’d have renamed it to try_to_use_fast
since currently if one must use the fast tokenizer one has to additionally check that that AutoTokenizer.from_pretrained
returned the slow version.
not sure, open to suggestions.
context: in m4 the codebase currently requires a fast tokenizer.
Thank you!
cc: @ArthurZucker
Issue Analytics
- State:
- Created 9 months ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Huggingface AutoTokenizer cannot be referenced when ...
I am trying to import AutoTokenizer and AutoModelWithLMHead, ...
Read more >Fast tokenizers' special powers - Hugging Face
Since the AutoTokenizer class picks a fast tokenizer by default, ... there are 9 labels: O is the label for the tokens that...
Read more >Working with Hugging Face Transformers and TF 2.0
I am assuming that you are aware of Transformers and its attention mechanism. The primary aim of this blog is to show how...
Read more >PyTorch-Transformers
... the AutoModel and AutoTokenizer classes of the pytorch-transformers library. ... the two sentences are not paraphrasing each other # Or get the...
Read more >How to use the transformers.AutoTokenizer function in ... - Snyk
AutoTokenizer examples, based on popular ways it is used in public projects. ... is None and labels is not None: # {'class1': 0,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It is indeed a bug, the
facebook/opt-1.3b
tokenizer config is missing thetokenizer_type
variable. And the use_fast argument is not passed down properly in that case. The fix is here #20823Agreed, the problem is now the inconsistency between two models. If it is only
OPT
related we can leave it as is, otherwise will have a look