question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyErrors when running multiprocessing_bpe_encoder.py

See original GitHub issue

Hi all,

I have created my own BPE vocab with the tokenizers library following the steps described here.

I am now trying to encode my corpus (made of Brazilian tweets) using the multiprocessing_bpe_encoder.py script. When doing so, the script works fines for a while and then crashes due to some KeyErrors:

processed 260000 lines
processed 270000 lines
processed 280000 lines
processed 290000 lines
processed 300000 lines
processed 310000 lines
processed 320000 lines
processed 330000 lines
processed 340000 lines
processed 350000 lines
processed 360000 lines
processed 370000 lines
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 117, in encode_lines
    tokens = self.encode(line)
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 101, in encode
    ids = bpe.encode(line)
  File "/scratch/mt4493/twitter_labor/code/envs/inference_env/lib/python3.7/site-packages/fairseq/data/encoders/gpt2_bpe_utils.py", line 119, in encode
    self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
  File "/scratch/mt4493/twitter_labor/code/envs/inference_env/lib/python3.7/site-packages/fairseq/data/encoders/gpt2_bpe_utils.py", line 119, in <genexpr>
    self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
KeyError: 'Ğ'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 130, in <module>
    main()
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 78, in main
    for i, (filt, enc_lines) in enumerate(encoded_lines, start=1):
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 325, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
KeyError: 'Ğ'

I also get another KeyError: KeyError: 'Ě'.

Neither my corpus nor my encoder.json and vocab.bpe contain these characters and I’m not sure what the problem is?

Thanks a lot in advance for the help.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
manueltonneaucommented, Nov 18, 2020

Cool. Will recreate my vocab and close this as soon as it is fixed on my side. Thanks a lot for the help 😃

1reaction
lematt1991commented, Nov 18, 2020
python -c "import tokenizers; print(tokenizers.__version__)"
0.9.4
Read more comments on GitHub >

github_iconTop Results From Across the Web

Python KeyError Exceptions and How to Handle Them
In this tutorial, you'll learn how to handle Python KeyError exceptions. They are often caused by a bad key lookup in a dictionary,...
Read more >
How to Fix KeyError Exceptions in Python
The Python KeyError is an exception that occurs when an attempt is made to access an item in a dictionary that does not...
Read more >
How to fix Python KeyError Exceptions in simple steps?
Know about Python KeyError Exception. And learn how to handle exceptions in Python. A detailed guide to Errors and Exceptions in Python.
Read more >
How To Handle KeyError Exceptions in Python | Nick McCullum
A KeyError exception is raised when you try to access a key that does not actually exist inside the dict you are trying...
Read more >
Python KeyError Exception Handling Examples
Python KeyError is raised when we try to access a key from dict, which doesn't exist. It's one of the built-in exception classes...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found