question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible bug in spm-based tokenizers

See original GitHub issue

Environment info

  • transformers version: latest (4.10.0.dev0)
  • Python version: 3.8
  • PyTorch version (GPU?): 1.9.0
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet …): mbart-large-50-many-to-many-mmt

To reproduce

Running the following script shows that encoding and decoding a Chinese string would not give back the same string (punctuation marks will be normalized):

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/mbart-large-50-many-to-many-mmt', src_lang='zh_CN', tgt_lang='zh_CN')

sentence = '您好,您打算到哪里去呢?'
input = tokenizer(sentence)
output = tokenizer.decode(input['input_ids'], skip_special_tokens=True)

print(output)
print(output == sentence)

stdout:

您好,您打算到哪里去呢?
False

Using slow version of the tokenizer or setting src_lang and tgt_lang attributes directly would give the same results.

Expected behavior

Expected stdout:

您好,您打算到哪里去呢?
True

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
patil-surajcommented, Sep 14, 2021

Hi @Mehrad0711 Sorry to only reply now.

I will try to allocate some time this week for it.

0reactions
github-actions[bot]commented, Jan 7, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >
SentencePiece Tokenizer Demystified | by Jonathan Kernes
It's actually a method for selecting tokens from a precompiled list, optimizing the tokenization process based on a supplied corpus.
Read more >
SentencePiece: A simple and language independent subword ...
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural ...
Read more >
NLP and tokenization - Manticore Search Manual
At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the...
Read more >
sacrebleu - PyPI
Different flags passed to each of these scripts can produce wide swings in the final score. All of these may handle tokenization in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found