question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Special tokens not tokenized properly

See original GitHub issue

Environment info

  • transformers version: 4.5.1
  • Python version: 3.8.5
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik

Information

Hi,

I have recently further pretrained a RoBERTa model with fairseq. I use a custom vocabulary, trained with the tokenizers module. After converting the fairseq model to pytorch, I loaded all my model-related files here.

When loading the tokenizer, I noticed that the special tokens are not tokenized properly.

To reproduce

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('manueltonneau/twibert-lowercase-50272') 
tokenizer.tokenize('<mask>') 
Out[7]: ['<mask>']
tokenizer.tokenize('<hashtag>') 
Out[8]: ['hashtag']
tokenizer.tokenize('<hashtag>')
Out[3]: [0, 23958, 2]

Expected behavior

Since <hashtag> is a special token in the vocabulary with ID 7 (see here), the last output should be: [0, 7, 2]. <hashtag> with the ‘<>’ should also be recognized as a unique token.

Potential explanation

When looking at the files from a similar model, it seems that the vocab is in txt format and they also have the bpe.codes file, which I don’t have. Could that be the issue? And if so, how do I convert my files to this format?

For vocab.txt, I have already found your lengthy explanation here, thanks for this.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
manueltonneaucommented, Jul 4, 2021

How did you add the additional special tokens? So you start from a pre-trained RoBERTa, then added additional special tokens and further pre-trained on a corpus?

I created a new vocab with the tokenizers module for which I added new special tokens. Here is the code I use below:

# Initialize a tokenizer
  tokenizer = Tokenizer(models.BPE())

  # Customize pre-tokenization and decoding
  tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
  tokenizer.decoder = decoders.ByteLevel()
  tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

  # And then train
  trainer = trainers.BpeTrainer(vocab_size=args.vocab_size, min_frequency=2, special_tokens=[
      "<s>",
      "<pad>",
      "</s>",
      "<unk>",
      "<mask>",
      "@USER",
      "HTTPURL",
      "<hashtag>",
      "</hashtag>"
      ], show_progress=True)
  files = [os.path.join(args.corpus_dir, filename) for filename in os.listdir(args.corpus_dir)]
  i = 0
  start_time = time.time()
  for file in files:
      print(f'Starting training on {file}')
      tokenizer.train([file], trainer=trainer)
      i = i + 1
      print(f'{i} files done out of {len(files)} files')
      print(f'Time elapsed: {time.time() - start_time} seconds')

  # And Save it
  output_dir = f'/scratch/mt4493/twitter_labor/twitter-labor-data/data/pretraining/US/vocab_files/{args.vocab_size}/{args.vocab_name}'
  if not os.path.exists(output_dir):
      os.makedirs(output_dir)
  tokenizer.model.save(output_dir)
0reactions
manueltonneaucommented, Jul 5, 2021

Works fine, thanks again!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Building a tokenizer, block by block - Hugging Face Course
Running the input through the model (using the pre-tokenized words to produce a sequence of tokens); Post-processing (adding the special tokens of the...
Read more >
What is Tokenization? - tokenex
Tokenization is the process of exchanging sensitive data for nonsensitive data called “tokens” that can be used in a database or internal ...
Read more >
Adding a new token to a transformer model without breaking ...
Similar issues happen with RoBERTa, where the following word does not appear to be tokenized correctly (it is tokenized without the 'Ġ' that ......
Read more >
Hugging face tokenizer cannot load files properly
There is some error in huggingface code so i loaded the tokenizer like this and it worked. tokenizer = ByteLevelBPETokenizer('tokens/vocab.json' ...
Read more >
CrazyTokenizer — RedditScore 0.7.0 documentation
CrazyTokenizer is a part of RedditScore project. It's a tokenizer - tool for splitting strings of text into tokens. Tokens can then be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found