Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug in CLIPTokenizer input handling

See original GitHub issue

🐛 Bug

Describe the bug

The output of OpenAI’s CLIP tokenizer is different than Torchtext’s tokenizer, when using the same inputs & settings.

To Reproduce Steps to reproduce the behavior:

Install the CLIP tokenizer from OpenAI’s repo or copy the code from simple_tokenizer.py:

pip install git+https://github.com/openai/CLIP.git

Download the merge file from here: https://pytorch.s3.amazonaws.com/models/captum/clip_bpe_simple_vocab_48895.txt

I recreated the unicode setup from the bytes_to_unicode function:

from clip.simple_tokenizer import SimpleTokenizer
open_ai_tokenizer = SimpleTokenizer()

from torchtext.transforms import CLIPTokenizer as CLIPTokenizer_TorchText

# Setup test input
bpe_v = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))
bpe_vocab = [chr(c) for c in bpe_v + [256 + n for n in list(range(0, 68))]]
bpe_vocab_str = " ".join(bpe_vocab) # removing the empty space makes both outputs even more different.


txt_output_open_ai = open_ai_tokenizer.encode(bpe_vocab_str)
print(txt_output_open_ai[-50:-25])

torchtext_module = CLIPTokenizer_TorchText(merges_path="clip_bpe_simple_vocab_48895.txt")
txt_output_torchtext = torchtext_module(bpe_vocab_str)
txt_output_torchtext = [int(i) for i in txt_output_torchtext]
print(txt_output_torchtext[-50:-25])

The above code ouputs the following:

[128, 360, 128, 511, 128, 511, 128, 363, 128, 363, 328, 16384, 41901, 72, 329, 72, 329, 128, 369, 128, 369, 128, 371, 128, 371]
[128, 360, 128, 511, 128, 511, 128, 363, 128, 363, 328, 16384, 41901, 128, 367, 128, 367, 128, 369, 128, 369, 128, 371, 128, 371]

Specifically 4 of these values differ (indices -36 to -32):

[41901, 72, 329, 72, 329, 128]
[41901, 128, 367, 128, 367, 128]

Expected behavior

The outputs should be the same.

Environment

PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0
Libc version: glibc-2.26

Python version: 3.7.13 (default, Apr 24 2022, 01:04:09)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.11.0+cu113
[pip3] torchaudio==0.11.0+cu113
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0+cu113
[conda] Could not collect
torchtext version is  0.12.0

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

Nayef211commented, May 3, 2022

@ProGamerGov thanks for surfacing this issue. We will discuss with @abhinavarora to take a look at this since he is the most familiar with the BPE merge logic implementation.

@ebsmothers just wanted to ensure that you are aware of this bug in our CLIPTokenizer implementation as I believe TorchMM is using this in your model training pipeline correct?

0reactions

Nayef211commented, May 19, 2022

@abhinavarora would you mind taking a look at this?