question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiprocessing Errors when Training for Multiple Epochs

See original GitHub issue

Bug Description When training RoBERTa for a text regression task for more than 1 epoch, I get an OSError and ValueError related to the multiprocessing package (see details below).

Reproduction Info The error persists even when using the exact same code as presented in the simpletransformers docu, except for model_args.num_train_epochs = 4 (and other values > 0). I also tried fixing model_args.process_count and afterwards model_args.use_multiprocessing = False, both to no success. Code runs with no errors when model_args.num_train_epochs = 1.

Setup Details

  • Debian Linux 10
  • GTX 1080 Ti / RTX 2080 Ti with CUDA 11
  • Python 3.7, PyTorch 1.7.0, simpletransformers 0.48.14.

Full Error Message

Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File ".../anaconda3/lib/python3.7/multiprocessing/queues.py", line 232, in _feed
    close()
  File ".../anaconda3/lib/python3.7/multiprocessing/connection.py", line 177, in close
    self._close()
  File ".../anaconda3/lib/python3.7/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File ".../anaconda3/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File ".../anaconda3/lib/python3.7/multiprocessing/queues.py", line 263, in _feed
    queue_sem.release()
ValueError: semaphore or lock released too many times

Edit: Added GPU, PyTorch, and CUDA specs.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ThilinaRajapaksecommented, Nov 26, 2020

Setting “process_count”: 1, “use_multiprocessing”: False, and “dataloader_num_workers”: 1 did the trick. Did you try this?

0reactions
stale[bot]commented, Feb 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - How can take advantage of multiprocessing and ...
Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU...
Read more >
Optimize Cross-Validation Time Three Times Faster Using ...
Multiprocessing didn't impair the training speed. The difference between the two results for each combination doesn't differ too much.
Read more >
The training always freezes after some epochs. #22671 - GitHub
The training always freezes after some epochs. GPU usage is constantly 100%, the data loader also stops working. No error information.
Read more >
Varying errors during training - fastai - fast.ai Course Forums
I am trying to train a model using a vision_learner based on some thousand images. I have tried with several different networks, ...
Read more >
Multi-worker training with Keras | TensorFlow Core
This tutorial demonstrates how to perform multi-worker distributed training with a Keras model and the Model.fit API using the tf.distribute.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found