Roberta-large using BPE tokenizer generates multi tokens.
See original GitHub issueRoberta-large uses byte-level Byte-Pair-Encoding. It avoids the common PET training.
For example, Verbalization "Society" does not correspond to a single token, got ['Soc', 'iety']
Now I just comment the code assert ( # len(ids) == 1
in utils.py to enforce using the first tokenizer.
But I don’t know whether it will affect the accuracy. So is there any alternative since PET uses Roberta-large by default?
Thanks~
Issue Analytics
- State:
- Created a year ago
- Comments:6
Top Results From Across the Web
RoBERTa - Hugging Face
RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining...
Read more >transformers/tokenization_roberta.py at main · huggingface ...
Constructs a RoBERTa tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding. This tokenizer has been trained to treat spaces like ...
Read more >BERT WordPiece Tokenizer Tutorial | Towards Data Science
It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one...
Read more >A pipeline for large raw text preprocessing and model training ...
Regarding training, learning from scratch large models presents several challenges, ... To show the usefulness of the corpora generated with our approach, ...
Read more >Source code for hanlp.transform.transformer_tokenizer
_KEY] self.output_key = output_key if isinstance(tokenizer, ... boundary of tokens and tokenize each token into several subtokens then merge ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi,
GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a “Ġ”. Actually, “Society” is not a token in the vocab but “ĠSociety” is a valid one. You can call
tokenizer.convert_tokens_to_ids("ĠSociety")
and the result is3930
.The only thing you need to do is replace “tokenizer.encode(xxxxx)” with the following lines:
Refer to this thread for more details: https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante Best.
Hi, how to training PET model uses xlm-roberta with byte-level Byte-Pair-Encoding?