Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add in-layer TF Tokenizer to BPE tokenizers

See original GitHub issue

Feature request

As what we have with TFBertTokenizer, but with models that use Byte Pair Encoding (e.g. TFT5Tokenizer, TFClipTokenizer) etc…

They were implemented in keras-nlp (https://github.com/keras-team/keras-nlp/pull/389) and we can now bring them here.

Motivation

With that feature we will be able to serve almost every model with TF Serving, which will make it much easier to serve models, as we won’t have to write handlers and custom servers.

Having TF BPE Tokenizers is (I think) the last barrier to make transformers fully TF Serving-compliant.

Your contribution

I can submit a PR, but there are a huge lot of models for which we would need to do that, so I expect a large number of subtasks if you decide to go for it.

Also, as keras-nlp implemented it (https://github.com/keras-team/keras-nlp/pull/389), should we copy-paste the code for each tokenizer or import from keras-nlp, while keeping the reference to their repo?

Issue Analytics

State:
Created a year ago
Comments:24 (21 by maintainers)

Top GitHub Comments

1reaction

Rocketknight1commented, Nov 21, 2022

Don’t apologize at all - this is something we were struggling with too!

1reaction

piEspositocommented, Nov 16, 2022

Yeah I’m exploring it and guess what it is not as easy as I thought haha.

Top Results From Across the Web

Summary of the tokenizers - Hugging Face

More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......

Tokenizing with TF Text - TensorFlow

Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation.

BytePairTokenizer - Keras

keras_nlp.tokenizers.BytePairTokenizer( vocabulary, merges, sequence_length=None, **kwargs ). Bype-pair encoding tokenizer layer. This BPE tokenizer ...

NLG with GPT-2 - Jake Tae

BPE is a way of splitting up words to apply tokenization. ... are using TensorFlow, simply replace the argument with return_tensors="tf" .

BPE vs WordPiece Tokenization - when to use / which?

This pair is added to the vocab and the language model is again trained on the new vocab. These steps are repeated until...