[BUG] SAM Model RuntimeError: Cannot re-initialize CUDA in forked subprocess
See original GitHub issueWhen trying to use SAM with a GPU under WSL2 Ubuntu, I got the following exception:
output = model1.predict(data_sk)
Process Process-15:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/cdt/utils/parallel.py", line 54, in worker_subprocess
output = function(*args, **kwargs, device=device, idx=idx)
File "/usr/local/lib/python3.7/dist-packages/cdt/causality/graph/SAM.py", line 216, in run_SAM
data = th.from_numpy(data).to(device)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py", line 164, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
It seems to be a bug raised by parallel_run()
, when njobs>0
.
Cuda has successfully recognized my GPU:
torch.cuda.get_device_name(0)
'GeForce GTX 1080 Ti'
cdt.SETTINGS.GPU = 1
With CPU, SAM(nruns=1,njobs=1,gpus=0)
, the model can run smoothly.
My Spec:
cdt 0.5.23
Cuda 11.0
Python 3.7.5
Pytorch 1.7.1
Ubuntu 18.04 (WSL2), WIN10 Build 21292.1010
EDIT:
I reproduced the bug within the nvidia-docker (py3.6). The docker container was deployed under WSL2 Ubuntu instead of Docker Desktop for WIN. Pytorch and cuda recognized my GPU within the docker as in the WSL. So maybe it’s not a package version conflict.
And my input data is a (546, 13) Pandas dataframe.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top Results From Across the Web
Cannot re-initialize CUDA in forked subprocess - Stack Overflow
I load the model in the parent process and it's accessible to each forked worker process. The problem occurs when creating a CUDA-backed...
Read more >Cannot re-initialize CUDA in forked subprocess" Displayed in ...
When PyTorch is used to start multiple processes, the following error message is displayed:RuntimeError: Cannot re-initialize CUDA in forked subprocessThe ...
Read more >RuntimeError: Cannot re-initialize CUDA in forked subprocess ...
I'm getting the above error even though I'm not using multiprocessing. Not using multiprocessing, but getting CUDA error re. forked subprocess.
Read more >Bug listing with status CONFIRMED as at 2022/12/20 18:46:38
There is no server at all: it only relies on end-users' machines." status:CONFIRMED resolution: severity:normal · Bug:97934 - "kevedit-0.5.1.ebuild (New ...
Read more >1.1.7 PDF - PyTorch Lightning Documentation
The problem is that PyTorch has issues with num_workers >. 0 when using .spawn(). ... CUDA error: an illegal memory access was encountered....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hello @kangqiao-ctrl , thank you for your input ! I didn’t know that torch actually did allow to specify the spawn method, and I think that it would be the cleanest solution (I tried lots of things in the past, but up to no avail). I will implement this for the next version !
Hello, thanks for the feedback ; This is strange, I’ve got some quick questions :
I’ll look into it in the meanwhile!