How can I generate sentencepiece file or vocabulary from tokenizers?
See original GitHub issueAfter I make custom tokenizer using Tokenizers library, I could load it into XLNetTokenizerFast using
tokenizer = Tokenizer.from_file("unigram.json")
tok = XLNetTokenizerFast(tokenizer_object=tokenizer)
After I called
tok .save_vocabulary("ss")
it throws an error since I didnt load XLNetTokenizerFast using spm file. I believe save_vocabulary is looking for vocab_file parameter.
Is there any way to save_vocab after loading it from XLNetTokenizer?
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
Summary of the tokenizers - Hugging Face
Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords...
Read more >sentencepiece: Text Tokenization using Byte Pair Encoding ...
Description Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https: ...
Read more >SentencePiece Tokenizer Demystified | by Jonathan Kernes
Instead, we loop once at the start to find all words (not subwords, actual words) and create vocabulary, which is a dictionary matching...
Read more >speechbrain.tokenizers.SentencePiece module - Read the Docs
BPE class call the SentencePiece unsupervised text tokenizer from Google. ... file which is used for checking the accuracy of recovering words from...
Read more >Construct a Sentencepiece model - Rdrr.io
In sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram ... a character vector of path(s) to the text files containing ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you very much for reporting this problem @darwinharianto .
I have the impression that it linked to the problem reported in this issue and that I had started to work on in this PR. As the cleanest looking fix requires quite a bit of work, I had put it on hold. I’ll try to work on it again at the beginning of the week.
Duly noted! If it’s to do a sanity check, would it be ok to compare the files of the fast version of the tokenizers (in particular the
tokenizer.json
file)?To retrieve these files for the new tokenizer
tok
you made, the following command should work:For information, the vocabulary will be visible in the
tokenizer.json
file.