Broken pipe when training a model on CPU
See original GitHub issueHi,
I followed the instructions in README.md to train a A2C agent in DoorKey environment using the following command (Python 3.7.3) in Ubuntu 18.04 with 8 CPUs.
python scripts/train.py --algo a2c --env MiniGrid-DoorKey-5x5-v0 --model DoorKey --save-interval 10 --frames 80000
The train went well initially but ended with a BrokenPipeError exception that crashes the training process. The error message is copied below. According to scripts/train.py, the above command will run with 16 processes. Initially, I thought the error was because the training initialized too many processes. But even when setting --procs=6, the same exception happened again. Only when setting --procs=1, the training ran successfully. Is there any special setting I should do to enable the training with multi-processes?
(Just realized that the error roots in torch_ac)
Error Message
Exception ignored in: <function ParallelEnv.__del__ at 0x7f2df3411a60>
Traceback (most recent call last):
File "~/torch-ac/torch_ac/utils/penv.py", line 41, in __del__
File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 206, in send
File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
BrokenPipeError: [Errno 32] Broken pipe
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
Hi, @lcswillems,
I ran the code in Ubuntu 18.04 and Python 3.7.3 without GPU. I can not tell where in the training the error was triggered yet. I will check it out.
I am closing this issue because I think I fixed the issue. @oceank , if I didn’t, please tell me and I will reopen the issue.