question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failing to load the pre-trained weights on multi-gpus - FasterRCNN example

See original GitHub issue

🐛 Describe the bug

Hey, I have an issue when running the FasterRCNN example given here: it seems that the backbone is being loaded on cuda:0 while the model itself is being distributed to multiple GPUs. I saw that this issue was mentioned before for other architectures: #1037, #1038, vision issue.

I believe that this is a similar issue but I’m not sure how to handle this case. Appreciate the help

Error logs

image

Installation instructions

Install torchserve from source: No Running in docker: Yes, inside this image: nvcr.io/nvidia/pytorch:21.02-py3

I clone the serve repo, run the install dependencies script, and then pip install torchserve.

Model Packaing

I use the built in handler: https://github.com/pytorch/serve/blob/master/ts/torch_handler/object_detector.py

config.properties

default

Versions


Environment headers

Torchserve branch:

torchserve==0.6.0 torch-model-archiver==0.6.0

Python version: 3.8 (64-bit runtime) Python executable: /opt/conda/bin/python

Versions of relevant python libraries: captum==0.5.0 future==0.18.2 numpy==1.23.0 nvgpu==0.9.0 psutil==5.9.1 pytest==6.2.2 pytest-cov==2.11.1 pytest-pythonpath==0.7.3 pytorch-transformers==1.1.0 requests==2.28.0 sentencepiece==0.1.95 torch==1.9.0+cu111 torch-model-archiver==0.6.0 torch-workflow-archiver==0.2.4 torchaudio==0.9.0 torchserve==0.6.0 torchserve-dashboard==0.5.0 torchtext==0.10.0 torchvision==0.10.0+cu111 wheel==0.37.1 torch==1.9.0+cu111 torchtext==0.10.0 torchvision==0.10.0+cu111 torchaudio==0.9.0

Java Version:

OS: N/A GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: N/A CMake version: version 3.19.4

Is CUDA available: Yes CUDA runtime version: 11.2.67 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 3090 GPU 3: NVIDIA GeForce RTX 3090 GPU 4: NVIDIA GeForce RTX 3090 GPU 5: NVIDIA GeForce RTX 3090 GPU 6: NVIDIA GeForce RTX 3090 GPU 7: NVIDIA GeForce RTX 3090 Nvidia driver version: 510.54 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0

Repro instructions

Follow the steps here https://github.com/pytorch/serve/tree/master/examples/object_detector/fast-rcnn and run nvidia-smi in a different terminal

Possible Solution

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10

github_iconTop GitHub Comments

1reaction
agunapalcommented, Jul 28, 2022

Closing this. Please re-open if issue is not resolved

1reaction
agunapalcommented, Jul 7, 2022

@jonathan-ibex Thanks for checking. Will debug further and get back to you

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training the Model - NVIDIA Documentation Center
Here's an example of using the tlt-train command: tlt-train classification -e ... FasterRCNN loads the pretrained weights by name.
Read more >
02. Predict with pre-trained Faster RCNN models
This article shows how to play with pre-trained Faster RCNN model. First let's import some necessary libraries: from matplotlib import pyplot as plt...
Read more >
Keras model trained with Multi GPUs not loading on non gpu ...
I have got the following error: ValueError: You are trying to load a weight file containing 0 layers into a model with 26...
Read more >
Tensorflow Owner - Stack Exchange Data Explorer
'How can I visualize the weights(variables) in cnn in Tensorflow? ... 'Fail to run word embedding example in tensorflow tutorial with GPUs', ...
Read more >
Getting Started — MMDetection 2.2.1 documentation
Currently the config files in cityscapes use COCO pre-trained weights to initialize. You could download the pre-trained models in advance if network is ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found