New GPT2 tokenizer no longer encodes Unicode characters properly in Python 3
See original GitHub issueIn commit 5afa497cbfc53c679a9b22997b6312fad57ee2f8, you changed token.encode('utf-8')
to simply token
.
This would make the code compatible with Python 2, but now it breaks in Python 3. You’ll get a KeyError when you try to encode a Unicode character that requires more than 1 byte in UTF-8 encoding. For example, this raises a KeyError in Python 3:
from pytorch_pretrained_bert.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.encode('你')
I think what you want to do is:
if sys.version_info[0] == 2:
token = ''.join(self.byte_encoder[ord(b)] for b in token)
else:
token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
Issue Analytics
- State:
- Created 4 years ago
- Reactions:18
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Summary of the tokenizers - Hugging Face
The Unigram algorithm always keeps the base characters so that any word can be tokenized. Because Unigram is not based on merge rules...
Read more >Understanding the GPT-2 Source Code Part 2 - Medium
Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings.
Read more >NLG with GPT-2 - Jake Tae
HuggingFace Tokenizers Using the tokenizer that we initialized earlier, let's try encoding a simple sentence. Since we will be using PyTorch, ...
Read more >How many characters can be input into the "prompt" for GPT-2
GPT-2 does not work on character-level but on the subword level. The maximum length of text segments in was trained on was 1,024...
Read more >Train GPT-2 in your own language - Towards Data Science
Some notes on the tokenization: We use BPE (Byte Pair Encoding), which is a sub word encoding, this generally takes care of not...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Same here: This is also happening while using GPT2 tokenizer:
Traceback (most recent call last): File "run_lambada_gpt2.py", line 139, in tokenize_and_encode token_ids = tokenizer.encode(obj) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode return self.convert_tokens_to_ids(self.tokenize(text)) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize token = ''.join(self.byte_encoder[ord(b)] for b in token) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr> token = ''.join(self.byte_encoder[ord(b)] for b in token) KeyError: 8217
The sys version info is:
sys.version_info(major=3, minor=5, micro=5, releaselevel='final', serial=0)