Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Special tokens not tokenized properly

See original GitHub issue

Environment info

transformers version: 4.5.1
Python version: 3.8.5
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik

Information

Hi,

I have recently further pretrained a RoBERTa model with fairseq. I use a custom vocabulary, trained with the tokenizers module. After converting the fairseq model to pytorch, I loaded all my model-related files here.

When loading the tokenizer, I noticed that the special tokens are not tokenized properly.

To reproduce

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('manueltonneau/twibert-lowercase-50272') 
tokenizer.tokenize('<mask>') 
Out[7]: ['<mask>']
tokenizer.tokenize('<hashtag>') 
Out[8]: ['hashtag']
tokenizer.tokenize('<hashtag>')
Out[3]: [0, 23958, 2]

Expected behavior

Since <hashtag> is a special token in the vocabulary with ID 7 (see here), the last output should be: [0, 7, 2]. <hashtag> with the ‘<>’ should also be recognized as a unique token.

Potential explanation

When looking at the files from a similar model, it seems that the vocab is in txt format and they also have the bpe.codes file, which I don’t have. Could that be the issue? And if so, how do I convert my files to this format?

For vocab.txt, I have already found your lengthy explanation here, thanks for this.

Issue Analytics

State:
Created 2 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

manueltonneaucommented, Jul 4, 2021

How did you add the additional special tokens? So you start from a pre-trained RoBERTa, then added additional special tokens and further pre-trained on a corpus?

I created a new vocab with the tokenizers module for which I added new special tokens. Here is the code I use below:

# Initialize a tokenizer
  tokenizer = Tokenizer(models.BPE())

  # Customize pre-tokenization and decoding
  tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
  tokenizer.decoder = decoders.ByteLevel()
  tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

  # And then train
  trainer = trainers.BpeTrainer(vocab_size=args.vocab_size, min_frequency=2, special_tokens=[
      "<s>",
      "<pad>",
      "</s>",
      "<unk>",
      "<mask>",
      "@USER",
      "HTTPURL",
      "<hashtag>",
      "</hashtag>"
      ], show_progress=True)
  files = [os.path.join(args.corpus_dir, filename) for filename in os.listdir(args.corpus_dir)]
  i = 0
  start_time = time.time()
  for file in files:
      print(f'Starting training on {file}')
      tokenizer.train([file], trainer=trainer)
      i = i + 1
      print(f'{i} files done out of {len(files)} files')
      print(f'Time elapsed: {time.time() - start_time} seconds')

  # And Save it
  output_dir = f'/scratch/mt4493/twitter_labor/twitter-labor-data/data/pretraining/US/vocab_files/{args.vocab_size}/{args.vocab_name}'
  if not os.path.exists(output_dir):
      os.makedirs(output_dir)
  tokenizer.model.save(output_dir)

0reactions

manueltonneaucommented, Jul 5, 2021

Works fine, thanks again!

Top Results From Across the Web

Building a tokenizer, block by block - Hugging Face Course

Running the input through the model (using the pre-tokenized words to produce a sequence of tokens); Post-processing (adding the special tokens of the...

What is Tokenization? - tokenex

Tokenization is the process of exchanging sensitive data for nonsensitive data called “tokens” that can be used in a database or internal ...

Adding a new token to a transformer model without breaking ...

Similar issues happen with RoBERTa, where the following word does not appear to be tokenized correctly (it is tokenized without the 'Ġ' that ......

Hugging face tokenizer cannot load files properly

There is some error in huggingface code so i loaded the tokenizer like this and it worked. tokenizer = ByteLevelBPETokenizer('tokens/vocab.json' ...

CrazyTokenizer — RedditScore 0.7.0 documentation

CrazyTokenizer is a part of RedditScore project. It's a tokenizer - tool for splitting strings of text into tokens. Tokens can then be...