`XLMRobertaTokenizer` `encode_plus` api producing `<unk>` for a valid token
See original GitHub issueEnvironment info
transformers
version: 4.5.0.dev0 (latest master)- Platform: Windows-10-10.0.19041-SP0
- Python version: 3.7.10
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
XLMRobertaTokenizer
encode_plus
api producing <unk>
for a valid token
To reproduce
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
text = "请在黄鹂餐厅预订今晚7点半的位置。"
toks = tokenizer.tokenize(text)
assert toks == ['▁', '请', '在', '黄', '鹂', '餐厅', '预订', '今晚', '7', '点', '半', '的位置', '。']
output = tokenizer.encode_plus(text, add_special_tokens=False)
toks_converted = tokenizer.convert_ids_to_tokens(output['input_ids'])
assert toks_converted == ['▁', '请', '在', '黄', '<unk>', '餐厅', '预订', '今晚', '7', '点', '半', '的位置', '。']
Expected behavior
assert toks_converted[4] == '鹂' # not <unk>
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Utilities for Tokenizers - Hugging Face
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...
Read more >vr - Lalitas Lifestyle
C# example, calling XLM Roberta tokenizer and getting ids and offsets Let's load XLM Roberta model and tokenize a string, for each token...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey guys,
for the sake of completeness, here’s the double check with the reference implementation/tokenizer:
It outputs:
3 is the id for the unknown token, but you “reverse” tokenization with:
This outputs:
So the
<unk>
token also appears 😃hi guys.
I try to reproduce the code that is at the beginning of the topic and I get the following: