Make the `CLIPTokenizer`'s `encoder_json_path` variable optional, and use `dict(zip(vocab, range(len(vocab))))` instead
See original GitHub issue🚀 Feature
https://github.com/pytorch/text/blob/main/torchtext/transforms.py#L312
In both CLIP and OpenCLIP, the encoder is simply just the vocab run through dict(zip(vocab, range(len(vocab))))
, and it doesn’t make a ton of sense to require a encoder.json
file for this information. The encoder.json
file requirement should be optional as the vocab file itself can be used to create the encoder, making the encoder file redundant.
https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py#L74 https://github.com/mlfoundations/open_clip/blob/main/src/clip/tokenizer.py#L78
The current clip_encoder.json
test asset is just the Python dict created by dict(zip(vocab, range(len(vocab))))
, and thus will it makes a useful test for specifying an encoder, it’s redundant: https://github.com/pytorch/text/blob/main/test/asset/clip_encoder.json
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
cc: @abhinavarora
@ProGamerGov Here is PR: #1622