Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How can I generate sentencepiece file or vocabulary from tokenizers?

See original GitHub issue

After I make custom tokenizer using Tokenizers library, I could load it into XLNetTokenizerFast using

tokenizer = Tokenizer.from_file("unigram.json")
tok = XLNetTokenizerFast(tokenizer_object=tokenizer)

After I called

tok .save_vocabulary("ss")

it throws an error since I didnt load XLNetTokenizerFast using spm file. I believe save_vocabulary is looking for vocab_file parameter.

Is there any way to save_vocab after loading it from XLNetTokenizer?

Issue Analytics

State:
Created 2 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

2reactions

SaulLucommented, Aug 16, 2021

Thank you very much for reporting this problem @darwinharianto .

I have the impression that it linked to the problem reported in this issue and that I had started to work on in this PR. As the cleanest looking fix requires quite a bit of work, I had put it on hold. I’ll try to work on it again at the beginning of the week.

1reaction

SaulLucommented, Aug 17, 2021

Duly noted! If it’s to do a sanity check, would it be ok to compare the files of the fast version of the tokenizers (in particular the tokenizer.json file)?

To retrieve these files for the new tokenizer tok you made, the following command should work:

tok.save_pretrained("./dump", legacy_format=False)

For information, the vocabulary will be visible in the tokenizer.json file.

Top Results From Across the Web

Summary of the tokenizers - Hugging Face

Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords...

sentencepiece: Text Tokenization using Byte Pair Encoding ...

Description Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https: ...

SentencePiece Tokenizer Demystified | by Jonathan Kernes

Instead, we loop once at the start to find all words (not subwords, actual words) and create vocabulary, which is a dictionary matching...

speechbrain.tokenizers.SentencePiece module - Read the Docs

BPE class call the SentencePiece unsupervised text tokenizer from Google. ... file which is used for checking the accuracy of recovering words from...

Construct a Sentencepiece model - Rdrr.io

In sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram ... a character vector of path(s) to the text files containing ...