KeyErrors when running multiprocessing_bpe_encoder.py
See original GitHub issueHi all,
I have created my own BPE vocab with the tokenizers library following the steps described here.
I am now trying to encode my corpus (made of Brazilian tweets) using the multiprocessing_bpe_encoder.py script. When doing so, the script works fines for a while and then crashes due to some KeyErrors:
processed 260000 lines
processed 270000 lines
processed 280000 lines
processed 290000 lines
processed 300000 lines
processed 310000 lines
processed 320000 lines
processed 330000 lines
processed 340000 lines
processed 350000 lines
processed 360000 lines
processed 370000 lines
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 117, in encode_lines
tokens = self.encode(line)
File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 101, in encode
ids = bpe.encode(line)
File "/scratch/mt4493/twitter_labor/code/envs/inference_env/lib/python3.7/site-packages/fairseq/data/encoders/gpt2_bpe_utils.py", line 119, in encode
self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
File "/scratch/mt4493/twitter_labor/code/envs/inference_env/lib/python3.7/site-packages/fairseq/data/encoders/gpt2_bpe_utils.py", line 119, in <genexpr>
self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
KeyError: 'Ğ'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 130, in <module>
main()
File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 78, in main
for i, (filt, enc_lines) in enumerate(encoded_lines, start=1):
File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 325, in <genexpr>
return (item for chunk in result for item in chunk)
File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
KeyError: 'Ğ'
I also get another KeyError: KeyError: 'Ě'
.
Neither my corpus nor my encoder.json and vocab.bpe contain these characters and I’m not sure what the problem is?
Thanks a lot in advance for the help.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Python KeyError Exceptions and How to Handle Them
In this tutorial, you'll learn how to handle Python KeyError exceptions. They are often caused by a bad key lookup in a dictionary,...
Read more >How to Fix KeyError Exceptions in Python
The Python KeyError is an exception that occurs when an attempt is made to access an item in a dictionary that does not...
Read more >How to fix Python KeyError Exceptions in simple steps?
Know about Python KeyError Exception. And learn how to handle exceptions in Python. A detailed guide to Errors and Exceptions in Python.
Read more >How To Handle KeyError Exceptions in Python | Nick McCullum
A KeyError exception is raised when you try to access a key that does not actually exist inside the dict you are trying...
Read more >Python KeyError Exception Handling Examples
Python KeyError is raised when we try to access a key from dict, which doesn't exist. It's one of the built-in exception classes...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Cool. Will recreate my vocab and close this as soon as it is fixed on my side. Thanks a lot for the help 😃