Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Roberta-large using BPE tokenizer generates multi tokens.

See original GitHub issue

Roberta-large uses byte-level Byte-Pair-Encoding. It avoids the common PET training.

For example, Verbalization "Society" does not correspond to a single token, got ['Soc', 'iety']

Now I just comment the code assert ( # len(ids) == 1 in utils.py to enforce using the first tokenizer.

But I don’t know whether it will affect the accuracy. So is there any alternative since PET uses Roberta-large by default?

Thanks~

Issue Analytics

State:
Created a year ago
Comments:6

Top GitHub Comments

2reactions

huchinlpcommented, Oct 15, 2022

Hi,

GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a “Ġ”. Actually, “Society” is not a token in the vocab but “ĠSociety” is a valid one. You can call tokenizer.convert_tokens_to_ids("ĠSociety") and the result is 3930.

The only thing you need to do is replace “tokenizer.encode(xxxxx)” with the following lines:

if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
     space_word = "Ġ" + word
     id = tokenizer.convert_tokens_to_ids(space_word)
else:
     id = tokenizer.convert_tokens_to_ids(word)

Refer to this thread for more details: https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante Best.

0reactions

nieallencommented, Nov 29, 2022

Hi, how to training PET model uses xlm-roberta with byte-level Byte-Pair-Encoding?

Top Results From Across the Web

RoBERTa - Hugging Face

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining...

transformers/tokenization_roberta.py at main · huggingface ...

Constructs a RoBERTa tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding. This tokenizer has been trained to treat spaces like ...

BERT WordPiece Tokenizer Tutorial | Towards Data Science

It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one...

A pipeline for large raw text preprocessing and model training ...

Regarding training, learning from scratch large models presents several challenges, ... To show the usefulness of the corpora generated with our approach, ...

Source code for hanlp.transform.transformer_tokenizer

_KEY] self.output_key = output_key if isinstance(tokenizer, ... boundary of tokens and tokenize each token into several subtokens then merge ...