Inconsistency between the tokenization of `CLIPTokenizer` and `CLIPTokenizerFast` with `openai/clip-vit-base-patch32`
See original GitHub issueEnvironment info
transformers
version: 4.8.2- Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.9.0+cu102 (False)
- Tensorflow version (GPU?): 2.5.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
@patil-suraj, I think you worked on CLIP, maybe you could help me by confirming that this behavior is not normal. If it is and no one can deal with it first, I’d be happy to try to fix it.
Information
Model I am using (Bert, XLNet …): CLIP
To reproduce
The easiest way to reproduce is to open this google colab
Steps to reproduce the behavior:
- Import the slow and fast CLIP tokenizers from the transformers library and eventualy the tokenizer of https://github.com/openai/CLIP
from transformers import CLIPTokenizer, CLIPTokenizerFast
tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32")
from CLIP import clip as clip_orig
- Tokenize the same text with the 3 tokenizers
text = "A photo of a cat"
context_length = 77
tokens_ids_orig = clip_orig.tokenize(text)
tokens_ids_slow = tokenizer_slow.encode(text, padding="max_length", max_length=context_length, return_tensors='pt')
tokens_ids_fast = tokenizer_fast.encode(text, padding="max_length", max_length=context_length, return_tensors='pt')
- Compare the outputs
(tokens_ids_orig == tokens_ids_slow).sum() == context_length
Output: True
(tokens_ids_orig == tokens_ids_fast).sum() == context_length
Expected behavior
I think I would have expected the slow and fast versions to tokenize the text in the same way.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Found some inconsistency on CLIPTokenizer, but how should ...
This behavior happens because CLIPTokenizer tries to fix text via BasicTokenizer when ftfy is not installed. BasicTokenizer strips accents, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’m really sorry for the delay. I have investigated a bit and I think that unfortunately the last problem is not limited to the fact that spaces are replaced by
Ġ
.For example, here is the output on another example:
I think that we also need a pre tokenizer that reproduces the split induced in this line thanks to this regex:
r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+"""
. I think we could usetokenizers.pre_tokenizers.Split
with tokenizers.pre_tokenizers.Sequence but for the moment I couldn’t make it work.At this point, the only solution I can propose that comes close (but doesn’t match entirely) to the correct behavior is to replace the
tokenizer.pre_tokenizer=pre_tokenizers.ByteLevel(add_prefix_space=False)
line of theCLIPConverter
class intoconvert_slow_tokenizer.py
with :This would give on the previous example:
3 issues that are causing this in-consistency
ByteLevel
decoder
which was not removing the end of word suffix</w>
. UsingBPEDecoder
fixes thisbos
andeos
tokens, but the current post-processor isByteLevel
processor which does not add these, usingTemplateProcessing
instead fixes this.Ġ
. It instead repalces</w>
with space during decoding. But theBPE
tokenizer intokenizers
always seems to replace space withĠ
, which is the only remaining issue.Is there any way to disable this behavior @n1t0 @SaulLu ?