CLIP tokenizer inconsistent with OpenAI release
See original GitHub issueEnvironment info
transformers
version: 4.6.1- Platform: Linux-5.4.0-52-generic-x86_64-with-glibc2.10
- Python version: 3.8.5
- PyTorch version (GPU?): 1.8.1 (True)
- Tensorflow version (GPU?): 2.4.1 (False)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Information
Model I am using (Bert, XLNet …):
The problem arises when using:
- my own modified scripts: (give details below)
- the official example scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
>>> import clip
>>> import transformers
>>> clip.tokenize('hello world')
tensor([[49406, 3306, 1002, 49407, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]])
>>> tokenizer = transformers.CLIPTokenizerFast.from_pretrained('openai/clip-vit-base-patch32')
>>> tokenizer('hello world')
{'input_ids': [3306, 220, 1002], 'attention_mask': [1, 1, 1]}
The HF CLIPTokenizer seems to add an extra token while dropping the <bos> and <eos> tokens. Am I missing something here?
Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5
Top Results From Across the Web
Found some inconsistency on CLIPTokenizer, but how should ...
The easy fix I thought first was simply remove BasicTokenizer 's behavior at CLIPTokenizer. However I worry that this may harm the performance ......
Read more >CLIP: The Most Influential AI Model From OpenAI
CLIP is an open source, multi-modal, zero-shot model. Given an image and text descriptions, the model can predict the most relevant text description...
Read more >Transformers: State-of-the-Art Natural Language Processing
Case 2 : you have trained your own CLIP model using CLIPTokenizerFast . Your tokenizer is no longer a CLIPTokenizerFast and we recommend...
Read more >OpenAI DALL-E is here. - Medium
It is released on January 5, 2021 along with another neural network called “CLIP”. If you have never interacted with an advanced AI,...
Read more >OpenAI - Wikipedia
In March 2021, OpenAI released a paper titled Multimodal Neurons in Artificial Neural Networks, where they showed a detailed analysis of CLIP (and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, is there any update/eta on this?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.