Multiprocessing Errors when Training for Multiple Epochs
See original GitHub issueBug Description
When training RoBERTa for a text regression task for more than 1 epoch, I get an OSError
and ValueError
related to the multiprocessing package (see details below).
Reproduction Info
The error persists even when using the exact same code as presented in the simpletransformers docu, except for model_args.num_train_epochs = 4
(and other values > 0
). I also tried fixing model_args.process_count
and afterwards model_args.use_multiprocessing = False
, both to no success. Code runs with no errors when model_args.num_train_epochs = 1
.
Setup Details
- Debian Linux 10
- GTX 1080 Ti / RTX 2080 Ti with CUDA 11
- Python 3.7, PyTorch 1.7.0, simpletransformers 0.48.14.
Full Error Message
Exception in thread QueueFeederThread:
Traceback (most recent call last):
File ".../anaconda3/lib/python3.7/multiprocessing/queues.py", line 232, in _feed
close()
File ".../anaconda3/lib/python3.7/multiprocessing/connection.py", line 177, in close
self._close()
File ".../anaconda3/lib/python3.7/multiprocessing/connection.py", line 361, in _close
_close(self._handle)
OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".../anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File ".../anaconda3/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File ".../anaconda3/lib/python3.7/multiprocessing/queues.py", line 263, in _feed
queue_sem.release()
ValueError: semaphore or lock released too many times
Edit: Added GPU, PyTorch, and CUDA specs.
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (4 by maintainers)
Top Results From Across the Web
python - How can take advantage of multiprocessing and ...
Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU...
Read more >Optimize Cross-Validation Time Three Times Faster Using ...
Multiprocessing didn't impair the training speed. The difference between the two results for each combination doesn't differ too much.
Read more >The training always freezes after some epochs. #22671 - GitHub
The training always freezes after some epochs. GPU usage is constantly 100%, the data loader also stops working. No error information.
Read more >Varying errors during training - fastai - fast.ai Course Forums
I am trying to train a model using a vision_learner based on some thousand images. I have tried with several different networks, ...
Read more >Multi-worker training with Keras | TensorFlow Core
This tutorial demonstrates how to perform multi-worker distributed training with a Keras model and the Model.fit API using the tf.distribute.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.