question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`XLMRobertaTokenizer` `encode_plus` api producing `<unk>` for a valid token

See original GitHub issue

Environment info

  • transformers version: 4.5.0.dev0 (latest master)
  • Platform: Windows-10-10.0.19041-SP0
  • Python version: 3.7.10
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik

Information

XLMRobertaTokenizer encode_plus api producing <unk> for a valid token

To reproduce

from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

text = "请在黄鹂餐厅预订今晚7点半的位置。"
toks = tokenizer.tokenize(text)
assert toks == ['▁', '请', '在', '黄', '鹂', '餐厅', '预订', '今晚', '7', '点', '半', '的位置', '。']

output = tokenizer.encode_plus(text, add_special_tokens=False)
toks_converted = tokenizer.convert_ids_to_tokens(output['input_ids'])

assert toks_converted == ['▁', '请', '在', '黄', '<unk>', '餐厅', '预订', '今晚', '7', '点', '半', '的位置', '。']

Expected behavior

assert toks_converted[4] == '鹂'  # not <unk>

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
stefan-itcommented, Mar 31, 2021

Hey guys,

for the sake of completeness, here’s the double check with the reference implementation/tokenizer:

import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.base')
xlmr.eval()

tokens = xlmr.encode('请在黄鹂餐厅预订今晚7点半的位置。')

It outputs:

tensor([     0,      6,   9736,    213,  19390,      3, 113638, 209093, 155755,
           966,   2391,   6193,  57486,     30,      2])

3 is the id for the unknown token, but you “reverse” tokenization with:

xlmr.decode(tokens)

This outputs:

'请在黄<unk>餐厅预订今晚7点半的位置。'

So the <unk> token also appears 😃

0reactions
hapazvcommented, Apr 23, 2021

hi guys.

I try to reproduce the code that is at the beginning of the topic and I get the following: token roberta

Read more comments on GitHub >

github_iconTop Results From Across the Web

Utilities for Tokenizers - Hugging Face
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...
Read more >
vr - Lalitas Lifestyle
C# example, calling XLM Roberta tokenizer and getting ids and offsets Let's load XLM Roberta model and tokenize a string, for each token...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found