Cannot re-initialize CUDA in forked subprocess when loading model in seldon-core
See original GitHub issueI am using a seldon-core microservice to serve a fast-rcnn detection model. However, when passing the model to the desired Cuda device with torch model.to(device)
(at init_detector) the following error is thrown:
Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.
The same error is thrown even for human action recognition models of mmaction2 or pose models of mmpose. I have traced it to the checkpoint loading of the mmcv.runner
but am not sure how to proceed.
The specific suggestions at this known torch issue do not work in this context.
I am attaching a minimal reproducible example in the below zip file. cuda_error.zip
The contents are:
Detection.py
- minimal seldon python wrapper
Dockerfile
- the dockerfile for the environment
download_model.sh
- script to download the detection model
faster_rcnn_r50_caffe_fpn_mstrain_1x_coco-person.py
- model config file
Steps:
First, create the image: nvidia-docker build -f Dockerfile . -t seldontest
Then run the container: nvidia-docker run -p 5000:5000 -p 9000:9000 --name seldontest -it seldontest
I am not sure what the problem is because I’ve been able to deploy other models with gunicorn (which is what the seldon microservice uses) on a cuda device. But as already mentioned it seems to be related to the checkpoint loading of mmcv.
Issue Analytics
- State:
- Created 10 months ago
- Comments:7
@HAOCHENYE thanks a lot for your support. The
seldon-core
devs confirmed that it was an issue with how the microservice was handling multiprocessing. Still odd that it was failing for openmmlab models while working for other (torch) models on cuda.For reference to anyone: migrating to V2 solved the issue.
Alright, I’ll check the minimal example today, and give feedback ASAP!