Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot get CUDA device count, GPU metrics will not be available on multi-gpus

See original GitHub issue

Description I want to deploy Triton server via Azure Kubernetes Service. My target node is ND96asr v4 which is equipped with 8 A100 GPU. When running Triton server without loading any models, the following sentences are displayed.

root@fastertransformer-7dd47c77bb-46gpb:/workspace# mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/workspace
W0221 16:43:52.559411 1908 metrics.cc:274] Cannot get CUDA device count, GPU metrics will not be available
I0221 16:43:52.791832 1908 libtorch.cc:998] TRITONBACKEND_Initialize: pytorch
I0221 16:43:52.791877 1908 libtorch.cc:1008] Triton TRITONBACKEND API version: 1.4

(※ /workspace is empty dir) Among them,

Cannot get CUDA device count, GPU metrics will not be available

is trouble with loading model. I assume that the problem is caused by docker image because at startup

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 21.07 (build 24810355)

Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: No supported GPU(s) detected to run this container

ERROR: No supported GPU(s) detected to run this container

is obtained. However, I can execute nvidia-smi as

root@fastertransformer-749fc45c48-hdjhq:/workspace# nvidia-smi
Mon Feb 21 20:22:29 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   41C    P0    49W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000002:00:00.0 Off |                    0 |
| N/A   40C    P0    54W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000003:00:00.0 Off |                    0 |
| N/A   40C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000004:00:00.0 Off |                    0 |
| N/A   41C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  Off  | 0000000B:00:00.0 Off |                    0 |
| N/A   41C    P0    57W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  Off  | 0000000C:00:00.0 Off |                    0 |
| N/A   39C    P0    50W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  Off  | 0000000D:00:00.0 Off |                    0 |
| N/A   40C    P0    50W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  Off  | 0000000E:00:00.0 Off |                    0 |
| N/A   41C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

How to fix that? For comparison, I also try to deploy to machine equipped with single T4, and the startup succeeds.

root@fastertransformer-cc8dbdf6-vbp44:/workspace# nvidia-smi
Tue Feb 22 01:39:26 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000001:00:00.0 Off |                  Off |
| N/A   32C    P8     9W /  70W |      0MiB / 16127MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

root@fastertransformer-cc8dbdf6-vbp44:/workspace# mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/workspace
I0221 16:40:48.387855 61 metrics.cc:290] Collecting metrics for GPU 0: Tesla T4
I0221 16:40:48.615749 61 libtorch.cc:998] TRITONBACKEND_Initialize: pytorch

Therefore, I assume the settings for multi-gpus are wrong, but I do not what is wrong…

Triton Information

docker image: nvcr.io/nvidia/tritonserver:21.07-py3
nvidia driver: 470.57.02
CUDA: 11.4
K8S: 1.22.4
Node Image: AKSUbuntu-1804gen2containerd-2022.02.01
Node Size: Standard_ND96asr_v4

To Reproduce Run Triton server nvcr.io/nvidia/tritonserver:21.07-py3 on ND96asr v4 node via AKS.

Expected behavior Like the case of using single T4 machine, Triton server can collect metrics

root@fastertransformer-cc8dbdf6-vbp44:/workspace# mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/workspace
I0221 16:40:48.387855 61 metrics.cc:290] Collecting metrics for GPU 0: Tesla T4
I0221 16:40:48.615749 61 libtorch.cc:998] TRITONBACKEND_Initialize: pytorch
I0221 16:40:48.615782 61 libtorch.cc:1008] Triton TRITONBACKEND API version: 1.4
I0221 16:40:48.615786 61 libtorch.cc:1014] 'pytorch' TRITONBACKEND API version: 1.4
...

(※ /workspace is empty dir)

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

dyastremskycommented, Mar 30, 2022

Please do not reopen issues for new questions. Unless the original question needs follow-up, we ask that you open a new issue.

For nvidia-smi, you’ll want to check out their documentation and resources. Triton works fine on devices where it cannot retrieve GPU metrics. And I see FasterTransformer’s performance example uses an A100, though you can also check with them. The best way to see both is to run an inference request.

I’m closing this ticket. If you have any follow-up to the new questions or any additional questions, please open a new issue for those.

1reaction

shimoshidacommented, Mar 1, 2022

@dyastremsky Thank you for your reply. I’m sorry for overlooking the known issues and will wait for the modification!

Top Results From Across the Web

How to resolve 'Unable to get device count' GPU error

Hi all, just in case your run into 'Unable to get device count' GPU errors, here's what might help: run gadgetron_info (or any...

Could not get cuda device count - DeepStream SDK

Could not get cuda device count. When we run the pipeline, all the models are starting to instantiate and start running.

Multiple GPUs - CUDA.jl

When working with multiple devices, you need to be careful with allocated memory: Allocations are tied to the device that was active when...

nvidia-smi - NVIDIA System Management Interface program

Metrics can be consumed directly by users via stdout, or provided by file via CSV and XML ... There can't be any applications...

nvidia-smi: Control Your GPUs - Microway

On Windows, nvidia-smi is not able to set persistence mode. ... GeForce: varying levels of support, with fewer metrics available than on the ......

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Cannot get CUDA device count, GPU metrics will not be available on multi-gpus

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Triton cannot inference `tf.math.l2_normalize` correctly from ngc 21.06 ~ ngc 22.03 ( triton 2.20.0)

ONNX Backend Installation Error