Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`XLMRobertaTokenizer` `encode_plus` api producing `<unk>` for a valid token

See original GitHub issue

Environment info

transformers version: 4.5.0.dev0 (latest master)
Platform: Windows-10-10.0.19041-SP0
Python version: 3.7.10
PyTorch version (GPU?): not installed (NA)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik

Information

XLMRobertaTokenizer encode_plus api producing <unk> for a valid token

To reproduce

from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

text = "请在黄鹂餐厅预订今晚7点半的位置。"
toks = tokenizer.tokenize(text)
assert toks == ['▁', '请', '在', '黄', '鹂', '餐厅', '预订', '今晚', '7', '点', '半', '的位置', '。']

output = tokenizer.encode_plus(text, add_special_tokens=False)
toks_converted = tokenizer.convert_ids_to_tokens(output['input_ids'])

assert toks_converted == ['▁', '请', '在', '黄', '<unk>', '餐厅', '预订', '今晚', '7', '点', '半', '的位置', '。']

Expected behavior

assert toks_converted[4] == '鹂'  # not <unk>

Issue Analytics

State:
Created 2 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

stefan-itcommented, Mar 31, 2021

Hey guys,

for the sake of completeness, here’s the double check with the reference implementation/tokenizer:

import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.base')
xlmr.eval()

tokens = xlmr.encode('请在黄鹂餐厅预订今晚7点半的位置。')

It outputs:

tensor([     0,      6,   9736,    213,  19390,      3, 113638, 209093, 155755,
           966,   2391,   6193,  57486,     30,      2])

3 is the id for the unknown token, but you “reverse” tokenization with:

xlmr.decode(tokens)

This outputs:

'请在黄<unk>餐厅预订今晚7点半的位置。'

So the <unk> token also appears 😃

0reactions

hapazvcommented, Apr 23, 2021

hi guys.

I try to reproduce the code that is at the beginning of the topic and I get the following: token roberta

Top Results From Across the Web

Utilities for Tokenizers - Hugging Face

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

`XLMRobertaTokenizer` `encode_plus` api producing `<unk>` for a valid token

Environment info

Who can help

Information

To reproduce

Expected behavior

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

RuntimeError: while running run_common_voice.py (XLSR wav2vec finetuning week)

Issues finetuning MBART 50 many to many