Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

New GPT2 tokenizer no longer encodes Unicode characters properly in Python 3

See original GitHub issue

In commit 5afa497cbfc53c679a9b22997b6312fad57ee2f8, you changed token.encode('utf-8') to simply token.

This would make the code compatible with Python 2, but now it breaks in Python 3. You’ll get a KeyError when you try to encode a Unicode character that requires more than 1 byte in UTF-8 encoding. For example, this raises a KeyError in Python 3:

from pytorch_pretrained_bert.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.encode('你')

I think what you want to do is:

if sys.version_info[0] == 2:
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
else:
    token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))

Issue Analytics

State:
Created 4 years ago
Reactions:18
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

jamestjwcommented, Jun 6, 2019

I can confirm that this is happening, though it is a different dash.

1reaction

lirongyuancommented, Jun 8, 2019

I can confirm that this is happening, though it is a different dash.

Same here: This is also happening while using GPT2 tokenizer:

Traceback (most recent call last): File "run_lambada_gpt2.py", line 139, in tokenize_and_encode token_ids = tokenizer.encode(obj) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode return self.convert_tokens_to_ids(self.tokenize(text)) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize token = ''.join(self.byte_encoder[ord(b)] for b in token) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr> token = ''.join(self.byte_encoder[ord(b)] for b in token) KeyError: 8217

The sys version info is: sys.version_info(major=3, minor=5, micro=5, releaselevel='final', serial=0)

Top Results From Across the Web

Summary of the tokenizers - Hugging Face

The Unigram algorithm always keeps the base characters so that any word can be tokenized. Because Unigram is not based on merge rules...

Understanding the GPT-2 Source Code Part 2 - Medium

Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings.

NLG with GPT-2 - Jake Tae

HuggingFace Tokenizers Using the tokenizer that we initialized earlier, let's try encoding a simple sentence. Since we will be using PyTorch, ...

How many characters can be input into the "prompt" for GPT-2

GPT-2 does not work on character-level but on the subword level. The maximum length of text segments in was trained on was 1,024...

Train GPT-2 in your own language - Towards Data Science

Some notes on the tokenization: We use BPE (Byte Pair Encoding), which is a sub word encoding, this generally takes care of not...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

New GPT2 tokenizer no longer encodes Unicode characters properly in Python 3

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Clarifying attention mask

ImportError: cannot import name 'WEIGHTS_NAME' from 'pytorch_pretrained_bert.file_utils'