question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Tokenizers] Missmatch between fast and slow

See original GitHub issue

When I worked on the implementation of Whisper I realised that two different behaviors appear when you use a fast or slow tokenizer and have a OOV. Simple snippet :

>>> from transformers import GPT2Tokenizer, GPT2TokenizerFast
>>> fast = GPT2TokenizerFast.from_pretrained("gpt2")
>>> slow = GPT2Tokenizer.from_pretrained("gpt2")
>>> # the vocab size is 50257
>>> fast.decode(50258)
''
>>> slow.decode(50258)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/arthur_huggingface_co/transformers/src/transformers/tokenization_utils_base.py", line 3468, in decode
    return self._decode(
  File "/home/arthur_huggingface_co/transformers/src/transformers/tokenization_utils.py", line 938, in _decode
    for token in filtered_tokens:
TypeError: 'NoneType' object is not iterable

My question I guess is : which one is the expected one? Here is my take :

  • It should work, but output a warning saying that an OOV was encountered and was ignored.

WDYT @sgugger @LysandreJik @Narsil

Issue Analytics

  • State:open
  • Created 9 months ago
  • Reactions:1
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Dec 15, 2022

Yes, an exception should be raised and it’s more of a bug fix than a breaking change IMO. Users will be surprised, but they should be surprised when there is an out-of-vocab index.

0reactions
ArthurZuckercommented, Dec 20, 2022

we can, but we would also have to help the openAi team with their tokenizer that is based on pour GPT2TokenizerFast 😅

Read more comments on GitHub >

github_iconTop Results From Across the Web

Slow and Fast tokenizer gives different outputs(sentencepiece ...
When i use T5TokenizerFast(Tokenizer of T5 arcitecture), the output is expected as follows: ['·', '</s>', '·Hello', '·', '<sep>', '</s>'].
Read more >
Fast tokenizers' special powers - Hugging Face
In fact, the fast version might actually be slower! It's only when tokenizing lots of texts in parallel at the same time that...
Read more >
Slow and Fast tokenizer gives different outputs(sentencepiece ...
When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows: ['·', '</s>', '·Hello', '·', '<sep>', '</s>'].
Read more >
What makes languages "fast" or "slow"? - DEV Community ‍ ‍
They don't have the disadvantage of compile and link steps. (The engine will usually on-the-fly tokenize or bytecode the program, and then run ......
Read more >
The Tokenisation of Assets and Potential Implications ... - OECD
securities lending activities used as part of trading strategies, allowing for direct and faster unwinding of.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found