question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Calling crypten.nn.from_pytorch hangs

See original GitHub issue

Bug

We want to have a parent process that call crypten.nn.from_pytorch then child processes will do this as well, however, child processes seems to block at the first call to torch.onnx.export in the crypten.nn.from_pytorch function.

Reproduce

This example script can help in reproducing the bug

import torch
import crypten
import multiprocessing


def _get_model():
    class ExampleNet(torch.nn.Module):
        def __init__(self):
            super(ExampleNet, self).__init__()
            self.conv1 = torch.nn.Conv2d(1, 16, kernel_size=5, padding=0)
            self.fc1 = torch.nn.Linear(16 * 12 * 12, 100)
            self.fc2 = torch.nn.Linear(
                100, 2
            )  # For binary classification, final layer needs only 2 outputs

        def forward(self, x):
            out = self.conv1(x)
            out = torch.nn.functional.relu(out)
            out = torch.nn.functional.max_pool2d(out, 2)
            out = out.view(out.size(0), -1)
            out = self.fc1(out)
            out = torch.nn.functional.relu(out)
            out = self.fc2(out)
            return out

    dummy_input = torch.empty(1, 1, 28, 28)
    example_net = ExampleNet()
    model = crypten.nn.from_pytorch(example_net, dummy_input)
    return model


def proc():
    print("\tGetting model inside proc")
    # it blocks here only when we have called crypten.nn.from_pytorch in the parent process
    model = _get_model()
    print("\tGot model inside proc")
    return model


print("[+] Start")
# it doesn't block if we call this multiple times inside the same process
model = _get_model()
print("[+] Got model")
process = multiprocessing.Process(target=proc, args=())
print("[+] Starting process")
process.start()
print("[+] Waiting process")
process.join()
print("[+] End")

Environment

$ python collect_env.py
Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Manjaro Linux
GCC version: (GCC) 9.2.0
CMake version: version 3.16.2

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.18.0
[pip3] torch==1.3.1
[conda] torch                     1.4.0                    pypi_0    pypi
[conda] torchvision               0.5.0                    pypi_0    pypi
$ pip freeze
absl-py==0.9.0
appdirs==1.4.3
astor==0.8.1
attrs==19.3.0
backcall==0.1.0
black==19.10b0
bleach==3.1.0
certifi==2019.11.28
cffi==1.13.2
chardet==3.0.4
Click==7.0
coverage==4.5
-e git+https://github.com/facebookresearch/CrypTen.git@68e0364c66df95ddbb98422fb641382c3f58734c#egg=crypten
cryptography==2.8
decorator==4.4.1
defusedxml==0.6.0
dill==0.3.1.1
entrypoints==0.3
Flask==1.1.1
Flask-SocketIO==4.2.1
future==0.18.2
gast==0.2.2
google-pasta==0.1.8
grpcio==1.26.0
h5py==2.10.0
idna==2.8
importlib-metadata==1.3.0
ipykernel==5.1.3
ipython==7.10.2
ipython-genutils==0.2.0
ipywidgets==7.5.1
itsdangerous==1.1.0
jedi==0.15.1
Jinja2==2.10.3
joblib==0.14.1
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==5.3.4
jupyter-console==6.0.0
jupyter-core==4.6.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
lz4==3.0.2
Markdown==3.1.1
MarkupSafe==1.1.1
mistune==0.8.4
more-itertools==8.0.2
msgpack==1.0.0
nbconvert==5.6.1
nbformat==4.4.0
notebook==6.0.2
numpy==1.18.1
onnx==1.6.0
opt-einsum==3.1.0
packaging==20.1
pandocfilters==1.4.2
parso==0.5.2
pathspec==0.7.0
pexpect==4.7.0
phe==1.4.0
pickleshare==0.7.5
Pillow==6.2.2
pluggy==0.13.1
prometheus-client==0.7.1
prompt-toolkit==2.0.9
protobuf==3.11.2
ptyprocess==0.6.0
pudb==2019.2
py==1.8.1
pycparser==2.19
Pygments==2.5.2
pyOpenSSL==19.1.0
pyparsing==2.4.6
pyrsistent==0.15.6
pytest==5.3.4
pytest-cov==2.8.1
python-dateutil==2.8.1
python-engineio==3.11.1
python-socketio==4.4.0
PyYAML==5.2
pyzmq==18.1.0
qtconsole==4.6.0
regex==2020.2.20
requests==2.22.0
RestrictedPython==5.0
scikit-learn==0.22
scipy==1.4.1
Send2Trash==1.5.0
six==1.13.0
sklearn==0.0
-e git+git@github.com:youben11/PySyft.git@1faf4400ffcce224fde347333cbfe15a5ab12660#egg=syft
syft-proto==0.2.1a1.post2
tblib==1.6.0
tensorboard==1.15.0
tensorflow==1.15.0
tensorflow-estimator==1.15.1
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
tf-encrypted==0.5.9
toml==0.10.0
torch==1.4.0
torchvision==0.5.0
tornado==4.5.3
traitlets==4.3.3
typed-ast==1.4.1
typing-extensions==3.7.4.1
urllib3==1.25.7
urwid==2.1.0
wcwidth==0.1.7
webencodings==0.5.1
websocket-client==0.57.0
websockets==8.1
Werkzeug==0.16.0
widgetsnbextension==3.5.1
wrapt==1.11.2
zipp==0.6.0

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
marksibrahimcommented, Apr 29, 2020

@youben11 Adding torch.set_num_threads(1) as https://github.com/pytorch/pytorch/issues/36191#issuecomment-620956849 suggested does the trick. I successfully ran the script on Linux with the additional setting:

[+] Start
[+] Got model
[+] Starting process
[+] Waiting process
	Getting model inside proc
	Got model inside proc
[+] End

Feel free to reopen if you have trouble.

1reaction
marksibrahimcommented, Apr 8, 2020

@youben11 I was able to reproduce this behavior on my end. Unfortunately, it seems on some versions of Linux torch.onnx.export does not support spawning another child process for exporting 😦

I filed a bug with PyTorch https://github.com/pytorch/pytorch/issues/36191#issue-596231442 Feel free to add any context I may have missed on that issue / follow along.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python execution hangs when attempting to add a pytorch ...
Modules and feed them to an nn.Sequential constructor. For example: import torch.nn as nn modules = [] modules.append(nn.Linear ...
Read more >
nn.DataParallel gets stuck - PyTorch Forums
I'm trying to train a model on multiGPU using nn.DataParallel and the program gets stuck. (in the sense I can't even ctrl+c to...
Read more >
nn.DataParallel(model).cuda() hangs - PyTorch Forums
Hi, If I use cuda for my network by model.cuda() Everything is ok. The model is big, so it consumes 91% of video...
Read more >
DDP hangs upon creation - distributed - PyTorch Forums
Hi. I'm trying to use DDP on two nodes, but the DDP creation hangs forever. The code is like this: import torch import...
Read more >
PyTorch nn.DataParallel hang
import logging import os import argparse import sys import warnings import pandas as pd import numpy as np from tqdm import tqdm from ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found