A training loop got stuck in a certain condition with multi-processing updater and opencv.
See original GitHub issueA training loop got stuck in a certain condition with multi-processing updater and opencv. The issue does not appear when I use Pillow or serial updater.
- Conditions
- Chainer version: 2.0
- CuPy version: 1.0.0.1
- OS/Platform Ubuntu 14.04.5 (for PFN ppl sakura server 1)
- CUDA/cuDNN version: V8.0.44, I don’t know the way to check CUDNN version…
- Code to reproduce
https://github.com/apple2373/chainer-train-stuck
Note that this requires other libraries such as chainercv, open cv, etc…
then
python train.py --gpu 0 --mode 0
- Error messages, stack traces, or logs No message when stuck but the I can get the following message when I abort it.
stsutsui@sakura1:/mnt/sakura201/stsutsui/chainer-train-stuck$ python train.py --gpu 0 --mode 0
epoch iteration elapsed_time main/loss main/accuracy
0 1 7.3559 1.07867 0.498882
^CProcess Process-8:…] 1.00%
Process Process-7:###########…] 35.29%
Traceback (most recent call last):rations
Traceback (most recent call last):me to finish: 0:00:00.
File “/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
File “/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/site-packages/chainer/iterators/multiprocess_iterator.py”, line 386, in _worker
self.run()
File “/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/site-packages/chainer/iterators/multiprocess_iterator.py”, line 386, in _worker
cnt, mem_index, index = in_queue.get()
cnt, mem_index, index = in_queue.get()
File “/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/queues.py”, line 115, in get
File “/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/queues.py”, line 117, in get
res = self._recv()
self._rlock.acquire()
KeyboardInterrupt
KeyboardInterrupt
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
@apple2373 I reproduced it with this environment setting: https://gist.github.com/mitmul/2cd98788c07ebbae1815232a32f95728 You can trace what I saw in the same environment built with the Dockerfile.
And I also find a workaround. Just put
OMP_NUM_THREADS=1
before the execution of Python just solves the problem:This progresses the training without stucking. This is actually related to OpenCV’s
imread
method. Because if I replace theget_example
method ofSegDataset
with an alternative one just returns a numpy array created inside of the method, it processed without stucking.Well, another workaround is to set
cv.setNumThreads(0)
right after theimport cv2 as cv
in the source code.This issue is closed as announced. Feel free to re-open it if needed.