question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] SAM Model RuntimeError: Cannot re-initialize CUDA in forked subprocess

See original GitHub issue

When trying to use SAM with a GPU under WSL2 Ubuntu, I got the following exception:

output = model1.predict(data_sk)

Process Process-15:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/dist-packages/cdt/utils/parallel.py", line 54, in worker_subprocess
    output = function(*args, **kwargs, device=device, idx=idx)
  File "/usr/local/lib/python3.7/dist-packages/cdt/causality/graph/SAM.py", line 216, in run_SAM
    data = th.from_numpy(data).to(device)
  File "/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py", line 164, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

It seems to be a bug raised by parallel_run(), when njobs>0.

Cuda has successfully recognized my GPU:

torch.cuda.get_device_name(0) 'GeForce GTX 1080 Ti'

cdt.SETTINGS.GPU = 1

With CPU, SAM(nruns=1,njobs=1,gpus=0), the model can run smoothly.

My Spec:

cdt 0.5.23
Cuda 11.0
Python 3.7.5
Pytorch 1.7.1
Ubuntu 18.04 (WSL2), WIN10 Build 21292.1010

EDIT:

I reproduced the bug within the nvidia-docker (py3.6). The docker container was deployed under WSL2 Ubuntu instead of Docker Desktop for WIN. Pytorch and cuda recognized my GPU within the docker as in the WSL. So maybe it’s not a package version conflict.

And my input data is a (546, 13) Pandas dataframe.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
Diviyan-Kalainathancommented, Feb 11, 2021

Hello @kangqiao-ctrl , thank you for your input ! I didn’t know that torch actually did allow to specify the spawn method, and I think that it would be the cleanest solution (I tried lots of things in the past, but up to no avail). I will implement this for the next version !

1reaction
Diviyan-Kalainathancommented, Feb 2, 2021

Hello, thanks for the feedback ; This is strange, I’ve got some quick questions :

  • Are you initializing torch.cuda in the main process ?
  • Have you tried gpus=1,njobs=1 ?
  • Can you give me the docker image tag?

I’ll look into it in the meanwhile!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot re-initialize CUDA in forked subprocess - Stack Overflow
I load the model in the parent process and it's accessible to each forked worker process. The problem occurs when creating a CUDA-backed...
Read more >
Cannot re-initialize CUDA in forked subprocess" Displayed in ...
When PyTorch is used to start multiple processes, the following error message is displayed:RuntimeError: Cannot re-initialize CUDA in forked subprocessThe ...
Read more >
RuntimeError: Cannot re-initialize CUDA in forked subprocess ...
I'm getting the above error even though I'm not using multiprocessing. Not using multiprocessing, but getting CUDA error re. forked subprocess.
Read more >
Bug listing with status CONFIRMED as at 2022/12/20 18:46:38
There is no server at all: it only relies on end-users' machines." status:CONFIRMED resolution: severity:normal · Bug:97934 - "kevedit-0.5.1.ebuild (New ...
Read more >
1.1.7 PDF - PyTorch Lightning Documentation
The problem is that PyTorch has issues with num_workers >. 0 when using .spawn(). ... CUDA error: an illegal memory access was encountered....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found