question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Roberta-large using BPE tokenizer generates multi tokens.

See original GitHub issue

Roberta-large uses byte-level Byte-Pair-Encoding. It avoids the common PET training.

For example, Verbalization "Society" does not correspond to a single token, got ['Soc', 'iety']

Now I just comment the code assert ( # len(ids) == 1 in utils.py to enforce using the first tokenizer.

But I don’t know whether it will affect the accuracy. So is there any alternative since PET uses Roberta-large by default?

Thanks~

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6

github_iconTop GitHub Comments

2reactions
huchinlpcommented, Oct 15, 2022

Hi,

GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a “Ġ”. Actually, “Society” is not a token in the vocab but “ĠSociety” is a valid one. You can call tokenizer.convert_tokens_to_ids("ĠSociety") and the result is 3930.

The only thing you need to do is replace “tokenizer.encode(xxxxx)” with the following lines:

if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
     space_word = "Ġ" + word
     id = tokenizer.convert_tokens_to_ids(space_word)
else:
     id = tokenizer.convert_tokens_to_ids(word)

Refer to this thread for more details: https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante Best.

0reactions
nieallencommented, Nov 29, 2022

Hi, how to training PET model uses xlm-roberta with byte-level Byte-Pair-Encoding?

Read more comments on GitHub >

github_iconTop Results From Across the Web

RoBERTa - Hugging Face
RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining...
Read more >
transformers/tokenization_roberta.py at main · huggingface ...
Constructs a RoBERTa tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding. This tokenizer has been trained to treat spaces like ...
Read more >
BERT WordPiece Tokenizer Tutorial | Towards Data Science
It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one...
Read more >
A pipeline for large raw text preprocessing and model training ...
Regarding training, learning from scratch large models presents several challenges, ... To show the usefulness of the corpora generated with our approach, ...
Read more >
Source code for hanlp.transform.transformer_tokenizer
_KEY] self.output_key = output_key if isinstance(tokenizer, ... boundary of tokens and tokenize each token into several subtokens then merge ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found