Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`CLIPTokenizer` should output tensors in it's `forward` function rather than lists of numbers in str form

See original GitHub issue

🚀 Feature

Outputs for the current CLIP tokenizer appear to be a list of strings of numbers, rather than a tensor or even a list of numbers:

clip_tokenizer = torchtext.transforms.CLIPTokenizer(merges_path="clip_bpe.txt")

test_str = "This is a test"
test_output = clip_tokenizer(test_str)
print(test_output) # ['589', '533', '320', '1628']

It might be easier to have outputs be tensors with a shape of [batch, tokens].

OpenAI’s CLIP tokenizer for example returns outputs like these, with zeros filling up the rest of the model’s content_length (which also might be a good idea to include as a variable in torchtext’s tokenizer):

test_str = "This is a test"
test_output_clip = clip.tokenize(test_str)
print(test_output_clip)

# shape: [1, 77]
tensor([[49406,   589,   533,   320,  1628, 49407,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]])

https://github.com/openai/CLIP/blob/main/clip/clip.py#L195

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

abhinavaroracommented, Mar 1, 2022

Thanks @ProGamerGov for reporting this. We are looking to standardize the interface of our transforms/tokenizers and this will be an action item for us to refactor this if needed. The general philosophy is that any tokenizer has an interface like this:

def tokenize(sentence: str) -> List[str]:
    pass

Now the resulting tokens that are produced by the tokenizer could be anything, they could be real tokens or subword ids depending on the underlying tokenizer and the users are free to interpret them anyhow for downstream models. Let me know if this makes sense. Please feel free to share any ideas you have 😃

1reaction

parmeetcommented, Feb 28, 2022

You can also try to use Sequential to compose the transform as shown here for XLM-R text pre-processing. In the linked example, you would need to add ToTensor transform such that you would get tensor instead of List[List[str]]. For VocabTransform (converting string indices to corresponding integer ids) you can construct the corresponding vocab object and pass it to the transform.