[Tokenizers] Missmatch between fast and slow
See original GitHub issueWhen I worked on the implementation of Whisper I realised that two different behaviors appear when you use a fast
or slow
tokenizer and have a OOV.
Simple snippet :
>>> from transformers import GPT2Tokenizer, GPT2TokenizerFast
>>> fast = GPT2TokenizerFast.from_pretrained("gpt2")
>>> slow = GPT2Tokenizer.from_pretrained("gpt2")
>>> # the vocab size is 50257
>>> fast.decode(50258)
''
>>> slow.decode(50258)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/arthur_huggingface_co/transformers/src/transformers/tokenization_utils_base.py", line 3468, in decode
return self._decode(
File "/home/arthur_huggingface_co/transformers/src/transformers/tokenization_utils.py", line 938, in _decode
for token in filtered_tokens:
TypeError: 'NoneType' object is not iterable
My question I guess is : which one is the expected one? Here is my take :
- It should work, but output a warning saying that an OOV was encountered and was ignored.
WDYT @sgugger @LysandreJik @Narsil
Issue Analytics
- State:
- Created 9 months ago
- Reactions:1
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Slow and Fast tokenizer gives different outputs(sentencepiece ...
When i use T5TokenizerFast(Tokenizer of T5 arcitecture), the output is expected as follows: ['·', '</s>', '·Hello', '·', '<sep>', '</s>'].
Read more >Fast tokenizers' special powers - Hugging Face
In fact, the fast version might actually be slower! It's only when tokenizing lots of texts in parallel at the same time that...
Read more >Slow and Fast tokenizer gives different outputs(sentencepiece ...
When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows: ['·', '</s>', '·Hello', '·', '<sep>', '</s>'].
Read more >What makes languages "fast" or "slow"? - DEV Community
They don't have the disadvantage of compile and link steps. (The engine will usually on-the-fly tokenize or bytecode the program, and then run ......
Read more >The Tokenisation of Assets and Potential Implications ... - OECD
securities lending activities used as part of trading strategies, allowing for direct and faster unwinding of.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, an exception should be raised and it’s more of a bug fix than a breaking change IMO. Users will be surprised, but they should be surprised when there is an out-of-vocab index.
we can, but we would also have to help the openAi team with their tokenizer that is based on pour GPT2TokenizerFast 😅