question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`CLIPTokenizer` should output tensors in it's `forward` function rather than lists of numbers in str form

See original GitHub issue

🚀 Feature

Outputs for the current CLIP tokenizer appear to be a list of strings of numbers, rather than a tensor or even a list of numbers:

clip_tokenizer = torchtext.transforms.CLIPTokenizer(merges_path="clip_bpe.txt")

test_str = "This is a test"
test_output = clip_tokenizer(test_str)
print(test_output) # ['589', '533', '320', '1628']

It might be easier to have outputs be tensors with a shape of [batch, tokens].

OpenAI’s CLIP tokenizer for example returns outputs like these, with zeros filling up the rest of the model’s content_length (which also might be a good idea to include as a variable in torchtext’s tokenizer):

test_str = "This is a test"
test_output_clip = clip.tokenize(test_str)
print(test_output_clip)

# shape: [1, 77]
tensor([[49406,   589,   533,   320,  1628, 49407,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]])

https://github.com/openai/CLIP/blob/main/clip/clip.py#L195

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
abhinavaroracommented, Mar 1, 2022

Thanks @ProGamerGov for reporting this. We are looking to standardize the interface of our transforms/tokenizers and this will be an action item for us to refactor this if needed. The general philosophy is that any tokenizer has an interface like this:

def tokenize(sentence: str) -> List[str]:
    pass

Now the resulting tokens that are produced by the tokenizer could be anything, they could be real tokens or subword ids depending on the underlying tokenizer and the users are free to interpret them anyhow for downstream models. Let me know if this makes sense. Please feel free to share any ideas you have 😃

1reaction
parmeetcommented, Feb 28, 2022

You can also try to use Sequential to compose the transform as shown here for XLM-R text pre-processing. In the linked example, you would need to add ToTensor transform such that you would get tensor instead of List[List[str]]. For VocabTransform (converting string indices to corresponding integer ids) you can construct the corresponding vocab object and pass it to the transform.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CLIP - Hugging Face
CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. CLIP uses a...
Read more >
torchtext.transforms - PyTorch
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded...
Read more >
How to Use sorted() and sort() in Python - Real Python
This last point means that sorted() can be used on a list, and the output ... That list is then sorted and combined...
Read more >
How to Apply Transformers to Any Length of Text
Sentiment analysis is typically limited by the length of text that can be ... PyTorch tensors from the tokenizer (rather than Python lists)....
Read more >
nchar: Count the Number of Characters (or Bytes or Width)
logical: should NA be returned for invalid multibyte strings or "bytes" -encoded strings (rather than throwing an error)?.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found