Bug in CLIPTokenizer input handling
See original GitHub issue🐛 Bug
Describe the bug
The output of OpenAI’s CLIP tokenizer is different than Torchtext’s tokenizer, when using the same inputs & settings.
To Reproduce Steps to reproduce the behavior:
Install the CLIP tokenizer from OpenAI’s repo or copy the code from simple_tokenizer.py:
pip install git+https://github.com/openai/CLIP.git
Download the merge file from here: https://pytorch.s3.amazonaws.com/models/captum/clip_bpe_simple_vocab_48895.txt
I recreated the unicode setup from the bytes_to_unicode
function:
from clip.simple_tokenizer import SimpleTokenizer
open_ai_tokenizer = SimpleTokenizer()
from torchtext.transforms import CLIPTokenizer as CLIPTokenizer_TorchText
# Setup test input
bpe_v = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))
bpe_vocab = [chr(c) for c in bpe_v + [256 + n for n in list(range(0, 68))]]
bpe_vocab_str = " ".join(bpe_vocab) # removing the empty space makes both outputs even more different.
txt_output_open_ai = open_ai_tokenizer.encode(bpe_vocab_str)
print(txt_output_open_ai[-50:-25])
torchtext_module = CLIPTokenizer_TorchText(merges_path="clip_bpe_simple_vocab_48895.txt")
txt_output_torchtext = torchtext_module(bpe_vocab_str)
txt_output_torchtext = [int(i) for i in txt_output_torchtext]
print(txt_output_torchtext[-50:-25])
The above code ouputs the following:
[128, 360, 128, 511, 128, 511, 128, 363, 128, 363, 328, 16384, 41901, 72, 329, 72, 329, 128, 369, 128, 369, 128, 371, 128, 371]
[128, 360, 128, 511, 128, 511, 128, 363, 128, 363, 328, 16384, 41901, 128, 367, 128, 367, 128, 369, 128, 369, 128, 371, 128, 371]
Specifically 4 of these values differ (indices -36 to -32):
[41901, 72, 329, 72, 329, 128]
[41901, 128, 367, 128, 367, 128]
Expected behavior
The outputs should be the same.
Environment
PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0
Libc version: glibc-2.26
Python version: 3.7.13 (default, Apr 24 2022, 01:04:09) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.11.0+cu113
[pip3] torchaudio==0.11.0+cu113
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0+cu113
[conda] Could not collect
torchtext version is 0.12.0
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Issues · pytorch/text - GitHub
Bug in CLIPTokenizer input handling. #1692 opened on Apr 28 by ProGamerGov · 6 · what is currently the ideal effective torchtext pipeline...
Read more >Chapter 2 questions - Course - Hugging Face Forums
In the Handling multiple sequences notebook (for Tensorflow) of Chapter 2, there is a bug in the code under Tokenization section.
Read more >Huggingface AutoTokenizer can't load from local path
There is a reported bug in AutoTokenizer that isn't present in the underlying classes, such as BertTokenizer . – Marc Maxmeister. Sep 9...
Read more >My easy-to-install Windows GUI for Stable Diffusion is ready ...
self.tokenizer = CLIPTokenizer.from_pretrained( ... This is a bug in Windows' UI scaling, but I can fix it in the future.
Read more >Transformers: State-of-the-Art Natural Language Processing
Unfortunately, some bugs had crept into CLIPTokenizerFast : the tokenization produced by CLIPTokenizer and CLIPTokenizerFast were not equal.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ProGamerGov thanks for surfacing this issue. We will discuss with @abhinavarora to take a look at this since he is the most familiar with the BPE merge logic implementation.
@ebsmothers just wanted to ensure that you are aware of this bug in our CLIPTokenizer implementation as I believe TorchMM is using this in your model training pipeline correct?
@abhinavarora would you mind taking a look at this?