Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The purpose of files merges.txt, special_tokens_map.json, training_args.bin and add_tokens.json

See original GitHub issue

Good evening!

After I have my RoBERTa model pre-trained, I get the list of the following files: merges.txt, special_tokens_map.json, training_args.bin. I have also seen if you add extra tokens to the tokenizer, the file add_tokens.json appears. Could I ask to clarify the meaning of the first three files - how they are used and what they contain? And also how can I add extra tokens when pre-training RoBERTa or any BERT-type model? Million of thanks in advance!

Be safe, Akim

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:6 (5 by maintainers)

Top GitHub Comments

15reactions

piegucommented, Jun 20, 2020

First of all, like for GPT2, the Hugging Face (HF) tokenizer of RoBERTa is a Byte-level Byte-Pair-Encoding (BBPE) as written in the documentation.

Then, we can check in this page that in the attribute vocab_files_names, there are 2 files

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

Let’s open merges.txt of RoBERTa-base, for instance. The file starts like this:

#version: 0.2
Ä  t
Ä  a
h e
i n
r e
o n
Ä t he
e r
Ä  s
a t
Ä  w
Ä  o
...

Note: In this Roberta Tokenizer merge file, the special character Ä is used for encoding space instead of Ġ that is used by GPT2 Tokenizer (explanation 1 and explanation 2) but in the corresponding RoBERTa vocab file, the character Ġ is used. I do not know why.

The merge file shows what tokens will be merged at each iteration (thats’ why there is a space between tokens in the merge file).

About your example: It means that at the third iteration, the tokens pair he formed by the 2 tokens h and e is the most frequent in the corpus (token he without space before the token h).

If at the end of iterations, there is at least one pair he left (not merged with other tokens), it will appear in the vocab file (depends as well of the min_freq rules and number of tokens in vocab). Here, the id of he in the vocab file is 700.

Hope it helps but that would be great to get the point of view of someone from Hugging Face like @sshleifer or @sgugger.

9reactions

piegucommented, Jun 20, 2020

My understanding is that the file merges.txt is build during the training of the BBPE (Byte Level BPE) tokenizer on the corpus: it gets a new entry (line) at each iteration of the tokenizer to find the byte pairs most frequent.

For example, the first line can be Ġ d. Why? Because at the first iteration, the token most frequent is d (with a space in front of d) and the character Ġ means space.

What is the consequence in the vocabulary? The token Ġd is listed.

Hope I’m right. If not, please give me your explanation as I have not found any online.

Top Results From Across the Web

Source code for transformers.tokenization_xlm - Hugging Face

[docs]class XLMTokenizer(PreTrainedTokenizer): """ Construct an XLM tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following: ...

BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick

Add the special [CLS] and [SEP] tokens. Map the tokens to their IDs. Pad or truncate all sentences to the same length. Create...

Baseline with HuggingFace [Training][Beginners] - Kaggle

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added...

generate tokenizer.json from vocab.txt file - You.com

So, what is expected is that if DebertaTokenizer uses BPETokenizer, and this tokenizer expects to receive bpeencoder.bin, specialtokensmap.json and ...

Training RoBERTa from scratch - the missing guide

After you get JSON lines from wikipedia dump, you need to transform articles in JSON text into plaintext consumable by the language model ......