question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistency between the tokenization of `CLIPTokenizer` and `CLIPTokenizerFast` with `openai/clip-vit-base-patch32`

See original GitHub issue

Environment info

  • transformers version: 4.8.2
  • Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.9.0+cu102 (False)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@patil-suraj, I think you worked on CLIP, maybe you could help me by confirming that this behavior is not normal. If it is and no one can deal with it first, I’d be happy to try to fix it.

Information

Model I am using (Bert, XLNet …): CLIP

To reproduce

The easiest way to reproduce is to open this google colab

Steps to reproduce the behavior:

  1. Import the slow and fast CLIP tokenizers from the transformers library and eventualy the tokenizer of https://github.com/openai/CLIP
from transformers import CLIPTokenizer, CLIPTokenizerFast
tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32")
from CLIP import clip as clip_orig
  1. Tokenize the same text with the 3 tokenizers
text = "A photo of a cat"
context_length = 77
tokens_ids_orig = clip_orig.tokenize(text)
tokens_ids_slow = tokenizer_slow.encode(text, padding="max_length", max_length=context_length, return_tensors='pt')
tokens_ids_fast = tokenizer_fast.encode(text, padding="max_length", max_length=context_length, return_tensors='pt')
  1. Compare the outputs
(tokens_ids_orig == tokens_ids_slow).sum() == context_length

Output: True

(tokens_ids_orig == tokens_ids_fast).sum() == context_length

Expected behavior

I think I would have expected the slow and fast versions to tokenize the text in the same way.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
SaulLucommented, Sep 28, 2021

I’m really sorry for the delay. I have investigated a bit and I think that unfortunately the last problem is not limited to the fact that spaces are replaced by Ġ.

For example, here is the output on another example:

tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32", from_slow=True)

text = "A\n'll 11p223RF☆ho!!to? of a cat"
tokenizer_slow.tokenize(text)
# ['a</w>', "'ll</w>", '1</w>', '1</w>', 'p</w>', '2</w>', '2</w>', '3</w>', 'rf</w>', 'âĺĨ</w>', 'ho</w>', '!!</w>', 'to</w>', '?</w>', 'of</w>', 'a</w>', 'cat</w>']

tokenizer_fast.tokenize(text)
# ['a</w>', 'Ġ', "'</w>", 'll</w>', 'Ġ', '1', '1</w>', 'p</w>', '2', '2', '3</w>', 'rf</w>', 'âĺĨ</w>', 'ho</w>', '!!</w>', 'to</w>', '?</w>', 'Ġ', 'of</w>', 'Ġ', 'a</w>', 'Ġ', 'cat</w>']

I think that we also need a pre tokenizer that reproduces the split induced in this line thanks to this regex: r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""". I think we could use tokenizers.pre_tokenizers.Split with tokenizers.pre_tokenizers.Sequence but for the moment I couldn’t make it work.

At this point, the only solution I can propose that comes close (but doesn’t match entirely) to the correct behavior is to replace the tokenizer.pre_tokenizer=pre_tokenizers.ByteLevel(add_prefix_space=False) line of the CLIPConverter class into convert_slow_tokenizer.py with :

        tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
            [
                pre_tokenizers.pre_tokenizers.WhitespaceSplit(),
                pre_tokenizers.ByteLevel(
                    add_prefix_space=False,
                ),
            ]
        )

This would give on the previous example:

tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32", from_slow=True)

text = "A\n'll 11p223RF☆ho!!to? of a cat"
tokenizer_slow.tokenize(text)
# ['a</w>', "'ll</w>", '1</w>', '1</w>', 'p</w>', '2</w>', '2</w>', '3</w>', 'rf</w>', 'âĺĨ</w>', 'ho</w>', '!!</w>', 'to</w>', '?</w>', 'of</w>', 'a</w>', 'cat</w>']

tokenizer_fast.tokenize(text)
# ['a</w>', "'ll</w>", '1', '1</w>', 'p</w>', '2', '2', '3</w>', 'rf</w>', 'âĺĨ</w>', 'ho</w>', '!!</w>', 'to</w>', '?</w>', 'of</w>', 'a</w>', 'cat</w>']
1reaction
patil-surajcommented, Sep 9, 2021

3 issues that are causing this in-consistency

  • The fast tokenizer was using ByteLevel decoder which was not removing the end of word suffix </w>. Using BPEDecoder fixes this
  • CLIP uses bos and eos tokens, but the current post-processor is ByteLevel processor which does not add these, using TemplateProcessing instead fixes this.
  • Unlike GPT2’s BPE tokenizer, CLIP’s BPE does not represent space with Ġ. It instead repalces </w> with space during decoding. But the BPE tokenizer in tokenizers always seems to replace space with Ġ, which is the only remaining issue.
tokenizer_slow = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32", from_slow=True)

text = "A photo of a cat"
tokenizer_slow.tokenize(text)
# ['a</w>', 'photo</w>', 'of</w>', 'a</w>', 'cat</w>']

tokenizer_fast.tokenize(text)
# ['a</w>', 'Ġ', 'photo</w>', 'Ġ', 'of</w>', 'Ġ', 'a</w>', 'Ġ', 'cat</w>']

Is there any way to disable this behavior @n1t0 @SaulLu ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Found some inconsistency on CLIPTokenizer, but how should ...
This behavior happens because CLIPTokenizer tries to fix text via BasicTokenizer when ftfy is not installed. BasicTokenizer strips accents, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found