question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError during the training example with OIM

See original GitHub issue

Hi all

After I executed the command python examples/resnet.py -d viper -b 64 -j 2 --loss oim --logs-dir logs/resnet-viper-oim I encountered the following errors:

Process Process-4: Traceback (most recent call last): File “/root/miniconda2/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap self.run() File “/root/miniconda2/lib/python2.7/multiprocessing/process.py”, line 114, in run self._target(*self._args, **self._kwargs) File “/root/miniconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py”, line 45, in _worker_loop data_queue.put((idx, samples)) File “/root/miniconda2/lib/python2.7/multiprocessing/queues.py”, line 392, in put return send(obj) File “/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/queue.py”, line 17, in send ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj) File “/root/miniconda2/lib/python2.7/pickle.py”, line 224, in dump self.save(obj) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/pickle.py”, line 554, in save_tuple save(element) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/pickle.py”, line 606, in save_list self._batch_appends(iter(obj)) File “/root/miniconda2/lib/python2.7/pickle.py”, line 639, in _batch_appends save(x) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/multiprocessing/forking.py”, line 67, in dispatcher self.save_reduce(obj=obj, *rv) File “/root/miniconda2/lib/python2.7/pickle.py”, line 401, in save_reduce save(args) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/pickle.py”, line 554, in save_tuple save(element) File “/root/miniconda2/lib/python2.7/pickle.py”, line 286, in save f(self, obj) # Call unbound method with explicit self File “/root/miniconda2/lib/python2.7/multiprocessing/forking.py”, line 66, in dispatcher rv = reduce(obj) File “/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/reductions.py”, line 113, in reduce_storage fd, size = storage.share_fd() RuntimeError: unable to write to file </torch_29225_1654046705> at /py/conda-bld/pytorch_1493669264383/work/torch/lib/TH/THAllocator.c:267

When switch to the xentropy loss with python examples/resnet.py -d viper -b 64 -j 1 --loss xentropy --logs-dir logs/resnet-viper-xentropy The following error occured:

Exception in thread Thread-1: Traceback (most recent call last): File “/root/miniconda2/lib/python2.7/threading.py”, line 801, in __bootstrap_inner self.run() File “/root/miniconda2/lib/python2.7/threading.py”, line 754, in run self.__target(*self.__args, **self.__kwargs) File “/root/miniconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py”, line 51, in _pin_memory_loop r = in_queue.get() File “/root/miniconda2/lib/python2.7/multiprocessing/queues.py”, line 378, in get return recv() File “/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/queue.py”, line 22, in recv return pickle.loads(buf) File “/root/miniconda2/lib/python2.7/pickle.py”, line 1388, in loads return Unpickler(file).load() File “/root/miniconda2/lib/python2.7/pickle.py”, line 864, in load dispatchkey File “/root/miniconda2/lib/python2.7/pickle.py”, line 1139, in load_reduce value = func(*args) File “/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/reductions.py”, line 68, in rebuild_storage_fd fd = multiprocessing.reduction.rebuild_handle(df) File “/root/miniconda2/lib/python2.7/multiprocessing/reduction.py”, line 155, in rebuild_handle conn = Client(address, authkey=current_process().authkey) File “/root/miniconda2/lib/python2.7/multiprocessing/connection.py”, line 169, in Client c = SocketClient(address) File “/root/miniconda2/lib/python2.7/multiprocessing/connection.py”, line 308, in SocketClient s.connect(address) File “/root/miniconda2/lib/python2.7/socket.py”, line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 111] Connection refused

In both situations, the terminal is frozen after these errors prompt. I have to kill the corresponding Python process in order to exit. Any suggestions to solve this?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:21 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
lzj322commented, Jul 17, 2017

@GBJim, @Cysu I guess that Dataparallel of pytorch doesn’t work well with Nvidia-docker. Or maybe it is caused by pytorch pytorch forum

0reactions
Cysucommented, Jul 18, 2017

@lzj322 Yeah, two programs cannot run on the same device if using NCCL.

Read more comments on GitHub >

github_iconTop Results From Across the Web

"Set changed size during iteration" Error in Worker ... - GitHub
I am getting an error in this line: distributed/distributed/worker.py Line 1970 in d54388c for dep in ts.dependents: Maybe a temporary copy ...
Read more >
Runtime error while training the model in pytorch
i am using one of the pretrained models from torchvision.models to get the image features. Build and train a new feed-forward classifier using ......
Read more >
Oracle Identity Manager Training -- Session 1 - YouTube
Hello Guys,I have started my training classes on Oracle Identity ... OIM API Basics For sample videos do reach on my YouTube channel....
Read more >
How to use 11g API for adaptars custom - eehelp.com
I write adaptar customized for OIM 11 g. Old api using, I am able to write the code and it is woring fine....
Read more >
https://www.cisco.com/c/dam/en/us/td/docs/ios_xr_s...
Reducing the memory consumption on the node is the only fix. ... This is a run-time error probably caused by resource constraints.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found