Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistency between the tokenization of `CLIPTokenizer` and `CLIPTokenizerFast` with `openai/clip-vit-base-patch32`

See original GitHub issue

Environment info

transformers version: 4.8.2
Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.9.0+cu102 (False)
Tensorflow version (GPU?): 2.5.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@patil-suraj, I think you worked on CLIP, maybe you could help me by confirming that this behavior is not normal. If it is and no one can deal with it first, I’d be happy to try to fix it.

Information

Model I am using (Bert, XLNet …): CLIP

To reproduce

The easiest way to reproduce is to open this google colab

Steps to reproduce the behavior:

Import the slow and fast CLIP tokenizers from the transformers library and eventualy the tokenizer of https://github.com/openai/CLIP

from transformers import CLIPTokenizer, CLIPTokenizerFast
tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32")

from CLIP import clip as clip_orig

Tokenize the same text with the 3 tokenizers

text = "A photo of a cat"
context_length = 77

tokens_ids_orig = clip_orig.tokenize(text)
tokens_ids_slow = tokenizer_slow.encode(text, padding="max_length", max_length=context_length, return_tensors='pt')
tokens_ids_fast = tokenizer_fast.encode(text, padding="max_length", max_length=context_length, return_tensors='pt')

Compare the outputs

(tokens_ids_orig == tokens_ids_slow).sum() == context_length

Output: True

(tokens_ids_orig == tokens_ids_fast).sum() == context_length

Expected behavior

I think I would have expected the slow and fast versions to tokenize the text in the same way.

Issue Analytics

State:
Created 2 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

SaulLucommented, Sep 28, 2021

I’m really sorry for the delay. I have investigated a bit and I think that unfortunately the last problem is not limited to the fact that spaces are replaced by Ġ.

For example, here is the output on another example:

tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32", from_slow=True)

text = "A\n'll 11p223RF☆ho!!to? of a cat"
tokenizer_slow.tokenize(text)
# ['a</w>', "'ll</w>", '1</w>', '1</w>', 'p</w>', '2</w>', '2</w>', '3</w>', 'rf</w>', 'âĺĨ</w>', 'ho</w>', '!!</w>', 'to</w>', '?</w>', 'of</w>', 'a</w>', 'cat</w>']

tokenizer_fast.tokenize(text)
# ['a</w>', 'Ġ', "'</w>", 'll</w>', 'Ġ', '1', '1</w>', 'p</w>', '2', '2', '3</w>', 'rf</w>', 'âĺĨ</w>', 'ho</w>', '!!</w>', 'to</w>', '?</w>', 'Ġ', 'of</w>', 'Ġ', 'a</w>', 'Ġ', 'cat</w>']

I think that we also need a pre tokenizer that reproduces the split induced in this line thanks to this regex: r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""". I think we could use tokenizers.pre_tokenizers.Split with tokenizers.pre_tokenizers.Sequence but for the moment I couldn’t make it work.

At this point, the only solution I can propose that comes close (but doesn’t match entirely) to the correct behavior is to replace the tokenizer.pre_tokenizer=pre_tokenizers.ByteLevel(add_prefix_space=False) line of the CLIPConverter class into convert_slow_tokenizer.py with :

        tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
            [
                pre_tokenizers.pre_tokenizers.WhitespaceSplit(),
                pre_tokenizers.ByteLevel(
                    add_prefix_space=False,
                ),
            ]
        )

This would give on the previous example:

tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32", from_slow=True)

text = "A\n'll 11p223RF☆ho!!to? of a cat"
tokenizer_slow.tokenize(text)
# ['a</w>', "'ll</w>", '1</w>', '1</w>', 'p</w>', '2</w>', '2</w>', '3</w>', 'rf</w>', 'âĺĨ</w>', 'ho</w>', '!!</w>', 'to</w>', '?</w>', 'of</w>', 'a</w>', 'cat</w>']

tokenizer_fast.tokenize(text)
# ['a</w>', "'ll</w>", '1', '1</w>', 'p</w>', '2', '2', '3</w>', 'rf</w>', 'âĺĨ</w>', 'ho</w>', '!!</w>', 'to</w>', '?</w>', 'of</w>', 'a</w>', 'cat</w>']

1reaction

patil-surajcommented, Sep 9, 2021

3 issues that are causing this in-consistency

The fast tokenizer was using ByteLevel decoder which was not removing the end of word suffix </w>. Using BPEDecoder fixes this
CLIP uses bos and eos tokens, but the current post-processor is ByteLevel processor which does not add these, using TemplateProcessing instead fixes this.
Unlike GPT2’s BPE tokenizer, CLIP’s BPE does not represent space with Ġ. It instead repalces </w> with space during decoding. But the BPE tokenizer in tokenizers always seems to replace space with Ġ, which is the only remaining issue.

tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32", from_slow=True)

text = "A photo of a cat"
tokenizer_slow.tokenize(text)
# ['a</w>', 'photo</w>', 'of</w>', 'a</w>', 'cat</w>']

tokenizer_fast.tokenize(text)
# ['a</w>', 'Ġ', 'photo</w>', 'Ġ', 'of</w>', 'Ġ', 'a</w>', 'Ġ', 'cat</w>']

Is there any way to disable this behavior @n1t0 @SaulLu ?