The purpose of files merges.txt, special_tokens_map.json, training_args.bin and add_tokens.json
See original GitHub issueGood evening!
After I have my RoBERTa model pre-trained, I get the list of the following files:
merges.txt
, special_tokens_map.json
, training_args.bin
. I have also seen if you add extra tokens to the tokenizer, the file add_tokens.json
appears. Could I ask to clarify the meaning of the first three files - how they are used and what they contain? And also how can I add extra tokens when pre-training RoBERTa or any BERT-type model? Million of thanks in advance!
Be safe, Akim
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Source code for transformers.tokenization_xlm - Hugging Face
[docs]class XLMTokenizer(PreTrainedTokenizer): """ Construct an XLM tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following: ...
Read more >BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick
Add the special [CLS] and [SEP] tokens. Map the tokens to their IDs. Pad or truncate all sentences to the same length. Create...
Read more >Baseline with HuggingFace [Training][Beginners] - Kaggle
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added...
Read more >generate tokenizer.json from vocab.txt file - You.com
So, what is expected is that if DebertaTokenizer uses BPETokenizer, and this tokenizer expects to receive bpeencoder.bin, specialtokensmap.json and ...
Read more >Training RoBERTa from scratch - the missing guide
After you get JSON lines from wikipedia dump, you need to transform articles in JSON text into plaintext consumable by the language model ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
First of all, like for GPT2, the Hugging Face (HF) tokenizer of RoBERTa is a Byte-level Byte-Pair-Encoding (BBPE) as written in the documentation.
Then, we can check in this page that in the attribute
vocab_files_names
, there are 2 filesLet’s open merges.txt of RoBERTa-base, for instance. The file starts like this:
Note: In this Roberta Tokenizer merge file, the special character
Ä
is used for encoding space instead ofĠ
that is used by GPT2 Tokenizer (explanation 1 and explanation 2) but in the corresponding RoBERTa vocab file, the characterĠ
is used. I do not know why.The merge file shows what tokens will be merged at each iteration (thats’ why there is a space between tokens in the merge file).
About your example: It means that at the third iteration, the tokens pair
he
formed by the 2 tokensh
ande
is the most frequent in the corpus (tokenhe
without space before the tokenh
).If at the end of iterations, there is at least one pair
he
left (not merged with other tokens), it will appear in the vocab file (depends as well of themin_freq
rules and number of tokens in vocab). Here, the id ofhe
in the vocab file is 700.Hope it helps but that would be great to get the point of view of someone from Hugging Face like @sshleifer or @sgugger.
My understanding is that the file
merges.txt
is build during the training of the BBPE (Byte Level BPE) tokenizer on the corpus: it gets a new entry (line) at each iteration of the tokenizer to find the byte pairs most frequent.For example, the first line can be
Ġ d
. Why? Because at the first iteration, the token most frequent isd
(with a space in front of d) and the characterĠ
means space.What is the consequence in the vocabulary? The token
Ġd
is listed.Hope I’m right. If not, please give me your explanation as I have not found any online.