RuntimeError during the training example with OIM
See original GitHub issueHi all
After I executed the command
python examples/resnet.py -d viper -b 64 -j 2 --loss oim --logs-dir logs/resnet-viper-oim
I encountered the following errors:
Process Process-4: Traceback (most recent call last): File “/root/miniconda2/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap self.run() File “/root/miniconda2/lib/python2.7/multiprocessing/process.py”, line 114, in run self._target(*self._args, **self._kwargs) File “/root/miniconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py”, line 45, in _worker_loop data_queue.put((idx, samples)) File “/root/miniconda2/lib/python2.7/multiprocessing/queues.py”, line 392, in put return send(obj) File “/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/queue.py”, line 17, in send ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj) File “/root/miniconda2/lib/python2.7/pickle.py”, line 224, in dump self.save(obj) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/pickle.py”, line 554, in save_tuple save(element) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/pickle.py”, line 606, in save_list self._batch_appends(iter(obj)) File “/root/miniconda2/lib/python2.7/pickle.py”, line 639, in _batch_appends save(x) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/multiprocessing/forking.py”, line 67, in dispatcher self.save_reduce(obj=obj, *rv) File “/root/miniconda2/lib/python2.7/pickle.py”, line 401, in save_reduce save(args) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/pickle.py”, line 554, in save_tuple save(element) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/multiprocessing/forking.py”, line 66, in dispatcher rv = reduce(obj) File “/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/reductions.py”, line 113, in reduce_storage fd, size = storage.share_fd() RuntimeError: unable to write to file </torch_29225_1654046705> at /py/conda-bld/pytorch_1493669264383/work/torch/lib/TH/THAllocator.c:267
When switch to the xentropy loss with
python examples/resnet.py -d viper -b 64 -j 1 --loss xentropy --logs-dir logs/resnet-viper-xentropy
The following error occured:
Exception in thread Thread-1: Traceback (most recent call last): File “/root/miniconda2/lib/python2.7/threading.py”, line 801, in __bootstrap_inner self.run() File “/root/miniconda2/lib/python2.7/threading.py”, line 754, in run self.__target(*self.__args, **self.__kwargs) File “/root/miniconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py”, line 51, in _pin_memory_loop r = in_queue.get() File “/root/miniconda2/lib/python2.7/multiprocessing/queues.py”, line 378, in get return recv() File “/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/queue.py”, line 22, in recv return pickle.loads(buf) File “/root/miniconda2/lib/python2.7/pickle.py”, line 1388, in loads return Unpickler(file).load() File “/root/miniconda2/lib/python2.7/pickle.py”, line 864, in load dispatchkey File “/root/miniconda2/lib/python2.7/pickle.py”, line 1139, in load_reduce value = func(*args) File “/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/reductions.py”, line 68, in rebuild_storage_fd fd = multiprocessing.reduction.rebuild_handle(df) File “/root/miniconda2/lib/python2.7/multiprocessing/reduction.py”, line 155, in rebuild_handle conn = Client(address, authkey=current_process().authkey) File “/root/miniconda2/lib/python2.7/multiprocessing/connection.py”, line 169, in Client c = SocketClient(address) File “/root/miniconda2/lib/python2.7/multiprocessing/connection.py”, line 308, in SocketClient s.connect(address) File “/root/miniconda2/lib/python2.7/socket.py”, line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 111] Connection refused
In both situations, the terminal is frozen after these errors prompt. I have to kill the corresponding Python process in order to exit. Any suggestions to solve this?
Issue Analytics
- State:
- Created 6 years ago
- Comments:21 (8 by maintainers)
Top GitHub Comments
@GBJim, @Cysu I guess that Dataparallel of pytorch doesn’t work well with Nvidia-docker. Or maybe it is caused by pytorch pytorch forum
@lzj322 Yeah, two programs cannot run on the same device if using NCCL.