Failing to load the pre-trained weights on multi-gpus - FasterRCNN example
See original GitHub issue🐛 Describe the bug
Hey, I have an issue when running the FasterRCNN example given here: it seems that the backbone is being loaded on cuda:0 while the model itself is being distributed to multiple GPUs. I saw that this issue was mentioned before for other architectures: #1037, #1038, vision issue.
I believe that this is a similar issue but I’m not sure how to handle this case. Appreciate the help
Error logs
Installation instructions
Install torchserve from source: No
Running in docker: Yes, inside this image: nvcr.io/nvidia/pytorch:21.02-py3
I clone the serve repo, run the install dependencies script, and then pip install torchserve.
Model Packaing
I use the built in handler: https://github.com/pytorch/serve/blob/master/ts/torch_handler/object_detector.py
config.properties
default
Versions
Environment headers
Torchserve branch:
torchserve==0.6.0 torch-model-archiver==0.6.0
Python version: 3.8 (64-bit runtime) Python executable: /opt/conda/bin/python
Versions of relevant python libraries: captum==0.5.0 future==0.18.2 numpy==1.23.0 nvgpu==0.9.0 psutil==5.9.1 pytest==6.2.2 pytest-cov==2.11.1 pytest-pythonpath==0.7.3 pytorch-transformers==1.1.0 requests==2.28.0 sentencepiece==0.1.95 torch==1.9.0+cu111 torch-model-archiver==0.6.0 torch-workflow-archiver==0.2.4 torchaudio==0.9.0 torchserve==0.6.0 torchserve-dashboard==0.5.0 torchtext==0.10.0 torchvision==0.10.0+cu111 wheel==0.37.1 torch==1.9.0+cu111 torchtext==0.10.0 torchvision==0.10.0+cu111 torchaudio==0.9.0
Java Version:
OS: N/A GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: N/A CMake version: version 3.19.4
Is CUDA available: Yes CUDA runtime version: 11.2.67 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 3090 GPU 3: NVIDIA GeForce RTX 3090 GPU 4: NVIDIA GeForce RTX 3090 GPU 5: NVIDIA GeForce RTX 3090 GPU 6: NVIDIA GeForce RTX 3090 GPU 7: NVIDIA GeForce RTX 3090 Nvidia driver version: 510.54 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
Repro instructions
Follow the steps here https://github.com/pytorch/serve/tree/master/examples/object_detector/fast-rcnn and run nvidia-smi in a different terminal
Possible Solution
No response
Issue Analytics
- State:
- Created a year ago
- Comments:10
Top GitHub Comments
Closing this. Please re-open if issue is not resolved
@jonathan-ibex Thanks for checking. Will debug further and get back to you