question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CLIP tokenizer inconsistent with OpenAI release

See original GitHub issue

Environment info

  • transformers version: 4.6.1
  • Platform: Linux-5.4.0-52-generic-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.8.1 (True)
  • Tensorflow version (GPU?): 2.4.1 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

@patil-suraj

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

  • my own modified scripts: (give details below)
  • the official example scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

>>> import clip
>>> import transformers
>>> clip.tokenize('hello world')
tensor([[49406,  3306,  1002, 49407,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]])
>>> tokenizer = transformers.CLIPTokenizerFast.from_pretrained('openai/clip-vit-base-patch32')
>>> tokenizer('hello world')
{'input_ids': [3306, 220, 1002], 'attention_mask': [1, 1, 1]}

The HF CLIPTokenizer seems to add an extra token while dropping the <bos> and <eos> tokens. Am I missing something here?

Thanks!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
nickgkancommented, Jul 14, 2021

Hi, is there any update/eta on this?

0reactions
github-actions[bot]commented, Aug 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Found some inconsistency on CLIPTokenizer, but how should ...
The easy fix I thought first was simply remove BasicTokenizer 's behavior at CLIPTokenizer. However I worry that this may harm the performance ......
Read more >
CLIP: The Most Influential AI Model From OpenAI
CLIP is an open source, multi-modal, zero-shot model. Given an image and text descriptions, the model can predict the most relevant text description...
Read more >
Transformers: State-of-the-Art Natural Language Processing
Case 2 : you have trained your own CLIP model using CLIPTokenizerFast . Your tokenizer is no longer a CLIPTokenizerFast and we recommend...
Read more >
OpenAI DALL-E is here. - Medium
It is released on January 5, 2021 along with another neural network called “CLIP”. If you have never interacted with an advanced AI,...
Read more >
OpenAI - Wikipedia
In March 2021, OpenAI released a paper titled Multimodal Neurons in Artificial Neural Networks, where they showed a detailed analysis of CLIP (and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found