Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Tokenizers] Missmatch between fast and slow

See original GitHub issue

When I worked on the implementation of Whisper I realised that two different behaviors appear when you use a fast or slow tokenizer and have a OOV. Simple snippet :

>>> from transformers import GPT2Tokenizer, GPT2TokenizerFast
>>> fast = GPT2TokenizerFast.from_pretrained("gpt2")
>>> slow = GPT2Tokenizer.from_pretrained("gpt2")
>>> # the vocab size is 50257
>>> fast.decode(50258)
''

>>> slow.decode(50258)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/arthur_huggingface_co/transformers/src/transformers/tokenization_utils_base.py", line 3468, in decode
    return self._decode(
  File "/home/arthur_huggingface_co/transformers/src/transformers/tokenization_utils.py", line 938, in _decode
    for token in filtered_tokens:
TypeError: 'NoneType' object is not iterable

My question I guess is : which one is the expected one? Here is my take :

It should work, but output a warning saying that an OOV was encountered and was ignored.

WDYT @sgugger @LysandreJik @Narsil

Issue Analytics

State:
Created 9 months ago
Reactions:1
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Dec 15, 2022

Yes, an exception should be raised and it’s more of a bug fix than a breaking change IMO. Users will be surprised, but they should be surprised when there is an out-of-vocab index.

0reactions

ArthurZuckercommented, Dec 20, 2022

we can, but we would also have to help the openAi team with their tokenizer that is based on pour GPT2TokenizerFast 😅