question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot re-initialize CUDA in forked subprocess when loading model in seldon-core

See original GitHub issue

I am using a seldon-core microservice to serve a fast-rcnn detection model. However, when passing the model to the desired Cuda device with torch model.to(device) (at init_detector) the following error is thrown:

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

The same error is thrown even for human action recognition models of mmaction2 or pose models of mmpose. I have traced it to the checkpoint loading of the mmcv.runner but am not sure how to proceed.

The specific suggestions at this known torch issue do not work in this context.

I am attaching a minimal reproducible example in the below zip file. cuda_error.zip

The contents are:

Detection.py - minimal seldon python wrapper Dockerfile - the dockerfile for the environment download_model.sh - script to download the detection model faster_rcnn_r50_caffe_fpn_mstrain_1x_coco-person.py - model config file

Steps:

First, create the image: nvidia-docker build -f Dockerfile . -t seldontest Then run the container: nvidia-docker run -p 5000:5000 -p 9000:9000 --name seldontest -it seldontest

I am not sure what the problem is because I’ve been able to deploy other models with gunicorn (which is what the seldon microservice uses) on a cuda device. But as already mentioned it seems to be related to the checkpoint loading of mmcv.

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
rlleshicommented, Dec 15, 2022

@HAOCHENYE thanks a lot for your support. The seldon-core devs confirmed that it was an issue with how the microservice was handling multiprocessing. Still odd that it was failing for openmmlab models while working for other (torch) models on cuda.

For reference to anyone: migrating to V2 solved the issue.

1reaction
HAOCHENYEcommented, Dec 7, 2022

Alright, I’ll check the minimal example today, and give feedback ASAP!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot re-initialize CUDA in forked subprocess when loading ...
According to the docs we should use the load() method to load and initialize the model. However, when passing the model to the...
Read more >
Cannot re-initialize CUDA in forked subprocess - Stack Overflow
I load the model in the parent process and it's accessible to each forked worker process. The problem occurs when creating a CUDA-backed...
Read more >
Cannot re-initialize CUDA in forked subprocess" Displayed in ...
When PyTorch is used to start multiple processes, the following error message is displayed:RuntimeError: Cannot re-initialize CUDA in forked subprocessThe ...
Read more >
Cannot re-initialize CUDA in forked subprocess on network.to ...
Hello, I am trying to implement the DistributedDataParallel class in my training code. The training code is a block in a larger block...
Read more >
MIT was we will home can us about if page my has no
... large gallery table register ve june however october november market library really action start series model features air industry plan human provided ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found