question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug in CLIPTokenizer input handling

See original GitHub issue

🐛 Bug

Describe the bug

The output of OpenAI’s CLIP tokenizer is different than Torchtext’s tokenizer, when using the same inputs & settings.

To Reproduce Steps to reproduce the behavior:

Install the CLIP tokenizer from OpenAI’s repo or copy the code from simple_tokenizer.py:

pip install git+https://github.com/openai/CLIP.git

Download the merge file from here: https://pytorch.s3.amazonaws.com/models/captum/clip_bpe_simple_vocab_48895.txt

I recreated the unicode setup from the bytes_to_unicode function:

from clip.simple_tokenizer import SimpleTokenizer
open_ai_tokenizer = SimpleTokenizer()

from torchtext.transforms import CLIPTokenizer as CLIPTokenizer_TorchText

# Setup test input
bpe_v = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))
bpe_vocab = [chr(c) for c in bpe_v + [256 + n for n in list(range(0, 68))]]
bpe_vocab_str = " ".join(bpe_vocab) # removing the empty space makes both outputs even more different.


txt_output_open_ai = open_ai_tokenizer.encode(bpe_vocab_str)
print(txt_output_open_ai[-50:-25])

torchtext_module = CLIPTokenizer_TorchText(merges_path="clip_bpe_simple_vocab_48895.txt")
txt_output_torchtext = torchtext_module(bpe_vocab_str)
txt_output_torchtext = [int(i) for i in txt_output_torchtext]
print(txt_output_torchtext[-50:-25])

The above code ouputs the following:

[128, 360, 128, 511, 128, 511, 128, 363, 128, 363, 328, 16384, 41901, 72, 329, 72, 329, 128, 369, 128, 369, 128, 371, 128, 371]
[128, 360, 128, 511, 128, 511, 128, 363, 128, 363, 328, 16384, 41901, 128, 367, 128, 367, 128, 369, 128, 369, 128, 371, 128, 371]

Specifically 4 of these values differ (indices -36 to -32):

[41901, 72, 329, 72, 329, 128]
[41901, 128, 367, 128, 367, 128]

Expected behavior

The outputs should be the same.

Environment

PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0
Libc version: glibc-2.26

Python version: 3.7.13 (default, Apr 24 2022, 01:04:09)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.11.0+cu113
[pip3] torchaudio==0.11.0+cu113
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0+cu113
[conda] Could not collect
torchtext version is  0.12.0

Additional context Add any other context about the problem here.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Nayef211commented, May 3, 2022

@ProGamerGov thanks for surfacing this issue. We will discuss with @abhinavarora to take a look at this since he is the most familiar with the BPE merge logic implementation.

@ebsmothers just wanted to ensure that you are aware of this bug in our CLIPTokenizer implementation as I believe TorchMM is using this in your model training pipeline correct?

0reactions
Nayef211commented, May 19, 2022

@abhinavarora would you mind taking a look at this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · pytorch/text - GitHub
Bug in CLIPTokenizer input handling. #1692 opened on Apr 28 by ProGamerGov · 6 · what is currently the ideal effective torchtext pipeline...
Read more >
Chapter 2 questions - Course - Hugging Face Forums
In the Handling multiple sequences notebook (for Tensorflow) of Chapter 2, there is a bug in the code under Tokenization section.
Read more >
Huggingface AutoTokenizer can't load from local path
There is a reported bug in AutoTokenizer that isn't present in the underlying classes, such as BertTokenizer . – Marc Maxmeister. Sep 9...
Read more >
My easy-to-install Windows GUI for Stable Diffusion is ready ...
self.tokenizer = CLIPTokenizer.from_pretrained( ... This is a bug in Windows' UI scaling, but I can fix it in the future.
Read more >
Transformers: State-of-the-Art Natural Language Processing
Unfortunately, some bugs had crept into CLIPTokenizerFast : the tokenization produced by CLIPTokenizer and CLIPTokenizerFast were not equal.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found