Add a vocabulary_size argument to WordPieceTokenizer
See original GitHub issueWe should add a vocabulary_size
argument to the WordPieceTokenizer layer that forces the vocabulary size by truncating the passed in vocabulary if necessary.
Potential docstring:
vocabulary_size: Force the vocabulary to be exactly `vocabulary_size`,
by truncating the input vocabulary if necessary. This is not
equivalent to retraining a word piece vocabulary from scratch, but
can be useful for quick hyperparameter tuning.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (8 by maintainers)
Top Results From Across the Web
Subword tokenizers | Text - TensorFlow
This tutorial demonstrates how to generate a subword vocabulary from a dataset, and use it to build a text.BertTokenizer from the vocabulary.
Read more >WordPieceTokenizer - Keras
A word piece tokenizer layer. This layer provides an efficient, in graph, implementation of the WordPiece algorithm used by BERT and other models....
Read more >Summary of the tokenizers - Hugging Face
GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned...
Read more >How to find "num_words" or vocabulary size of Keras ...
So if I were to not pass num_words argument when initializing Tokenizer() , how do I find the vocabulary size after it is...
Read more >Subword Tokenizers for Pre-trained Models - Yekun Chai
Good balance the vocabulary size and the sequence length; ... self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you!
Hey @blackhat-coder, are you still working on this?