question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`AutoTokenizer` not enforcing `use_fast=True`

See original GitHub issue

This issue is about AutoTokenizer not enforcing use_fast=True.

This works:

$ python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/opt-13b', use_fast=True); \
assert t.is_fast, 'tokenizer is not fast'; print('Success')" 
Success

now the same code, but a different model ‘facebook/opt-1.3b’ that doesn’t have a fast optimizer:

$ python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/opt-1.3b', use_fast=True); \
assert t.is_fast, 'tokenizer is not fast'; print('Success')" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: tokenizer is not fast

now the doc says:

use_fast (bool, optional, defaults to True) — Whether or not to try to load the fast version of the tokenizer.

so it sort of hints with “try to load” that it won’t enforce it. But would you be open to a less ambiguous definition? something like:

use_fast (bool, optional, defaults to True) — Will try to load the fast version of the tokenizer if there is one and 
will quietly fallback onto the normal (slower) tokenizer if the model doesn't provide a fast one.

I think the use_fast arg name is ambiguous - I’d have renamed it to try_to_use_fast since currently if one must use the fast tokenizer one has to additionally check that that AutoTokenizer.from_pretrained returned the slow version.

not sure, open to suggestions.

context: in m4 the codebase currently requires a fast tokenizer.

Thank you!

cc: @ArthurZucker

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
ArthurZuckercommented, Dec 20, 2022

It is indeed a bug, the facebook/opt-1.3b tokenizer config is missing the tokenizer_type variable. And the use_fast argument is not passed down properly in that case. The fix is here #20823

1reaction
ArthurZuckercommented, Dec 20, 2022

Agreed, the problem is now the inconsistency between two models. If it is only OPT related we can leave it as is, otherwise will have a look

Read more comments on GitHub >

github_iconTop Results From Across the Web

Huggingface AutoTokenizer cannot be referenced when ...
I am trying to import AutoTokenizer and AutoModelWithLMHead, ...
Read more >
Fast tokenizers' special powers - Hugging Face
Since the AutoTokenizer class picks a fast tokenizer by default, ... there are 9 labels: O is the label for the tokens that...
Read more >
Working with Hugging Face Transformers and TF 2.0
I am assuming that you are aware of Transformers and its attention mechanism. The primary aim of this blog is to show how...
Read more >
PyTorch-Transformers
... the AutoModel and AutoTokenizer classes of the pytorch-transformers library. ... the two sentences are not paraphrasing each other # Or get the...
Read more >
How to use the transformers.AutoTokenizer function in ... - Snyk
AutoTokenizer examples, based on popular ways it is used in public projects. ... is None and labels is not None: # {'class1': 0,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found