question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make the `CLIPTokenizer`'s `encoder_json_path` variable optional, and use `dict(zip(vocab, range(len(vocab))))` instead

See original GitHub issue

🚀 Feature

https://github.com/pytorch/text/blob/main/torchtext/transforms.py#L312

In both CLIP and OpenCLIP, the encoder is simply just the vocab run through dict(zip(vocab, range(len(vocab)))), and it doesn’t make a ton of sense to require a encoder.json file for this information. The encoder.json file requirement should be optional as the vocab file itself can be used to create the encoder, making the encoder file redundant.

https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py#L74 https://github.com/mlfoundations/open_clip/blob/main/src/clip/tokenizer.py#L78

The current clip_encoder.json test asset is just the Python dict created by dict(zip(vocab, range(len(vocab)))), and thus will it makes a useful test for specifying an encoder, it’s redundant: https://github.com/pytorch/text/blob/main/test/asset/clip_encoder.json

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
parmeetcommented, Feb 15, 2022
1reaction
abhinavaroracommented, Feb 18, 2022

@abhinavarora Yeah, that makes sense and sounds good to me!

@ProGamerGov Here is PR: #1622

Read more comments on GitHub >

github_iconTop Results From Across the Web

Counting Word Frequencies with Python
Python has an easy way to count frequencies, but it requires the use of a new type of variable: the dictionary. Before you...
Read more >
Counting word frequency and making a dictionary from it
One way is use Counter as @Michael suggested, but to use your approach in which you want to start from empty an dict....
Read more >
course_2_assessment_3.py - Github-Gist
The key is the course name and the value is the number of credits. Find the total number of credits taken this semester...
Read more >
Bag of Words: Approach, Python Code, Limitations
Bag of Words is a simplified feature extraction method for text data that is easy to implement. It involves maintaining a vocabulary and ......
Read more >
Vocabulary.com Dictionary - Meanings, Definitions, Quizzes ...
Vocabulary.com is the world's best dictionary for English definitions, synonyms, quizzes, ... Get Word of the Day delivered straight to your inbox!
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found