`CLIPTokenizer` should output tensors in it's `forward` function rather than lists of numbers in str form
See original GitHub issue🚀 Feature
Outputs for the current CLIP tokenizer appear to be a list of strings of numbers, rather than a tensor or even a list of numbers:
clip_tokenizer = torchtext.transforms.CLIPTokenizer(merges_path="clip_bpe.txt")
test_str = "This is a test"
test_output = clip_tokenizer(test_str)
print(test_output) # ['589', '533', '320', '1628']
It might be easier to have outputs be tensors with a shape of [batch, tokens]
.
OpenAI’s CLIP tokenizer for example returns outputs like these, with zeros filling up the rest of the model’s content_length
(which also might be a good idea to include as a variable in torchtext’s tokenizer):
test_str = "This is a test"
test_output_clip = clip.tokenize(test_str)
print(test_output_clip)
# shape: [1, 77]
tensor([[49406, 589, 533, 320, 1628, 49407, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]])
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
CLIP - Hugging Face
CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. CLIP uses a...
Read more >torchtext.transforms - PyTorch
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded...
Read more >How to Use sorted() and sort() in Python - Real Python
This last point means that sorted() can be used on a list, and the output ... That list is then sorted and combined...
Read more >How to Apply Transformers to Any Length of Text
Sentiment analysis is typically limited by the length of text that can be ... PyTorch tensors from the tokenizer (rather than Python lists)....
Read more >nchar: Count the Number of Characters (or Bytes or Width)
logical: should NA be returned for invalid multibyte strings or "bytes" -encoded strings (rather than throwing an error)?.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @ProGamerGov for reporting this. We are looking to standardize the interface of our transforms/tokenizers and this will be an action item for us to refactor this if needed. The general philosophy is that any tokenizer has an interface like this:
Now the resulting tokens that are produced by the tokenizer could be anything, they could be real tokens or subword ids depending on the underlying tokenizer and the users are free to interpret them anyhow for downstream models. Let me know if this makes sense. Please feel free to share any ideas you have 😃
You can also try to use
Sequential
to compose the transform as shown here for XLM-R text pre-processing. In the linked example, you would need to addToTensor
transform such that you would get tensor instead ofList[List[str]]
. ForVocabTransform
(converting string indices to corresponding integer ids) you can construct the correspondingvocab
object and pass it to the transform.