question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add in-layer TF Tokenizer to BPE tokenizers

See original GitHub issue

Feature request

As what we have with TFBertTokenizer, but with models that use Byte Pair Encoding (e.g. TFT5Tokenizer, TFClipTokenizer) etc…

They were implemented in keras-nlp (https://github.com/keras-team/keras-nlp/pull/389) and we can now bring them here.

Motivation

With that feature we will be able to serve almost every model with TF Serving, which will make it much easier to serve models, as we won’t have to write handlers and custom servers.

Having TF BPE Tokenizers is (I think) the last barrier to make transformers fully TF Serving-compliant.

Your contribution

I can submit a PR, but there are a huge lot of models for which we would need to do that, so I expect a large number of subtasks if you decide to go for it.

Also, as keras-nlp implemented it (https://github.com/keras-team/keras-nlp/pull/389), should we copy-paste the code for each tokenizer or import from keras-nlp, while keeping the reference to their repo?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:24 (21 by maintainers)

github_iconTop GitHub Comments

1reaction
Rocketknight1commented, Nov 21, 2022

Don’t apologize at all - this is something we were struggling with too!

1reaction
piEspositocommented, Nov 16, 2022

Yeah I’m exploring it and guess what it is not as easy as I thought haha.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Summary of the tokenizers - Hugging Face
More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......
Read more >
Tokenizing with TF Text - TensorFlow
Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation.
Read more >
BytePairTokenizer - Keras
keras_nlp.tokenizers.BytePairTokenizer( vocabulary, merges, sequence_length=None, **kwargs ). Bype-pair encoding tokenizer layer. This BPE tokenizer ...
Read more >
NLG with GPT-2 - Jake Tae
BPE is a way of splitting up words to apply tokenization. ... are using TensorFlow, simply replace the argument with return_tensors="tf" .
Read more >
BPE vs WordPiece Tokenization - when to use / which?
This pair is added to the vocab and the language model is again trained on the new vocab. These steps are repeated until...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found