question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

New GPT2 tokenizer no longer encodes Unicode characters properly in Python 3

See original GitHub issue

In commit 5afa497cbfc53c679a9b22997b6312fad57ee2f8, you changed token.encode('utf-8') to simply token.

This would make the code compatible with Python 2, but now it breaks in Python 3. You’ll get a KeyError when you try to encode a Unicode character that requires more than 1 byte in UTF-8 encoding. For example, this raises a KeyError in Python 3:

from pytorch_pretrained_bert.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.encode('你')

I think what you want to do is:

if sys.version_info[0] == 2:
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
else:
    token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:18
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
jamestjwcommented, Jun 6, 2019
image I can confirm that this is happening, though it is a different dash.
1reaction
lirongyuancommented, Jun 8, 2019
image

I can confirm that this is happening, though it is a different dash.

Same here: This is also happening while using GPT2 tokenizer:

Traceback (most recent call last): File "run_lambada_gpt2.py", line 139, in tokenize_and_encode token_ids = tokenizer.encode(obj) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode return self.convert_tokens_to_ids(self.tokenize(text)) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize token = ''.join(self.byte_encoder[ord(b)] for b in token) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr> token = ''.join(self.byte_encoder[ord(b)] for b in token) KeyError: 8217

The sys version info is: sys.version_info(major=3, minor=5, micro=5, releaselevel='final', serial=0)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Summary of the tokenizers - Hugging Face
The Unigram algorithm always keeps the base characters so that any word can be tokenized. Because Unigram is not based on merge rules...
Read more >
Understanding the GPT-2 Source Code Part 2 - Medium
Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings.
Read more >
NLG with GPT-2 - Jake Tae
HuggingFace Tokenizers​​ Using the tokenizer that we initialized earlier, let's try encoding a simple sentence. Since we will be using PyTorch, ...
Read more >
How many characters can be input into the "prompt" for GPT-2
GPT-2 does not work on character-level but on the subword level. The maximum length of text segments in was trained on was 1,024...
Read more >
Train GPT-2 in your own language - Towards Data Science
Some notes on the tokenization: We use BPE (Byte Pair Encoding), which is a sub word encoding, this generally takes care of not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found