question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How can I generate sentencepiece file or vocabulary from tokenizers?

See original GitHub issue

After I make custom tokenizer using Tokenizers library, I could load it into XLNetTokenizerFast using

tokenizer = Tokenizer.from_file("unigram.json")
tok = XLNetTokenizerFast(tokenizer_object=tokenizer)

After I called

tok .save_vocabulary("ss")

it throws an error since I didnt load XLNetTokenizerFast using spm file. I believe save_vocabulary is looking for vocab_file parameter.

Is there any way to save_vocab after loading it from XLNetTokenizer?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
SaulLucommented, Aug 16, 2021

Thank you very much for reporting this problem @darwinharianto .

I have the impression that it linked to the problem reported in this issue and that I had started to work on in this PR. As the cleanest looking fix requires quite a bit of work, I had put it on hold. I’ll try to work on it again at the beginning of the week.

1reaction
SaulLucommented, Aug 17, 2021

Duly noted! If it’s to do a sanity check, would it be ok to compare the files of the fast version of the tokenizers (in particular the tokenizer.json file)?

To retrieve these files for the new tokenizer tok you made, the following command should work:

tok.save_pretrained("./dump", legacy_format=False)

For information, the vocabulary will be visible in the tokenizer.json file.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Summary of the tokenizers - Hugging Face
Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords...
Read more >
sentencepiece: Text Tokenization using Byte Pair Encoding ...
Description Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https: ...
Read more >
SentencePiece Tokenizer Demystified | by Jonathan Kernes
Instead, we loop once at the start to find all words (not subwords, actual words) and create vocabulary, which is a dictionary matching...
Read more >
speechbrain.tokenizers.SentencePiece module - Read the Docs
BPE class call the SentencePiece unsupervised text tokenizer from Google. ... file which is used for checking the accuracy of recovering words from...
Read more >
Construct a Sentencepiece model - Rdrr.io
In sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram ... a character vector of path(s) to the text files containing ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found