Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NVML_ERROR_NOT_SUPPORTED exception

See original GitHub issue

🐛 Describe the bug

Sometimes it can occur that NVML does not support monitoring queries to specific devices. Currently this leads to failing the startup phase.

Error logs

2022-07-04T12:33:15,023 [ERROR] Thread-20 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 91, in collect_all
    value(num_of_gpu)
  File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 72, in gpu_utilization
    statuses = list_gpus.device_statuses()
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 26, in device_status
    temperature = nv.nvmlDeviceGetTemperature(handle, nv.NVML_TEMPERATURE_GPU)
  File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 1956, in nvmlDeviceGetTemperature
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

Installation instructions

pytorch/torchserve:latest-gpu

Model Packaing

N/A

config.properties

No response

Versions

------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.6.0
torch-model-archiver==0.6.0

Python version: 3.6 (64-bit runtime)
Python executable: /usr/bin/python3

Versions of relevant python libraries:
future==0.18.2
numpy==1.19.5
nvgpu==0.9.0
psutil==5.9.1
requests==2.27.1
torch-model-archiver==0.6.0
torch-workflow-archiver==0.2.4
torchserve==0.6.0
wheel==0.30.0
**Warning: torch not present ..
**Warning: torchtext not present ..
**Warning: torchvision not present ..
**Warning: torchaudio not present ..

Java Version:


OS: N/A
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: N/A
CMake version: N/A

Repro instructions

run:

torchserve --start --foreground --model-store model-store/

Possible Solution

Deal with those exceptions.

Issue Analytics

State:
Created a year ago
Comments:8 (8 by maintainers)

Top GitHub Comments

2reactions

lromorcommented, Aug 18, 2022

In case anyone gets to a similar issue and would like to have a quick fix, I patched the code with:

diff --git a/ts/metrics/system_metrics.py b/ts/metrics/system_metrics.py
index c7aaf6a..9915c9e 100644
--- a/ts/metrics/system_metrics.py
+++ b/ts/metrics/system_metrics.py
@@ -7,6 +7,7 @@ from builtins import str
 import psutil
 from ts.metrics.dimension import Dimension
 from ts.metrics.metric import Metric
+import pynvml
 
 system_metrics = []
 dimension = [Dimension('Level', 'Host')]
@@ -69,7 +70,11 @@ def gpu_utilization(num_of_gpu):
         system_metrics.append(Metric('GPUMemoryUtilization', value['mem_used_percent'], 'percent', dimension_gpu))
         system_metrics.append(Metric('GPUMemoryUsed', value['mem_used'], 'MB', dimension_gpu))
 
-    statuses = list_gpus.device_statuses()
+    try:
+        statuses = list_gpus.device_statuses()
+    except pynvml.nvml.NVMLError_NotSupported:
+        statuses = []
+
     for idx, status in enumerate(statuses):
         dimension_gpu = [Dimension('Level', 'Host'), Dimension("device_id", idx)]
         system_metrics.append(Metric('GPUUtilization', status['utilization'], 'percent', dimension_gpu))

1reaction

lromorcommented, Jul 18, 2022

Hi @msaroufim , I’ve opened an issue here: https://forums.developer.nvidia.com/t/nvml-issue-with-virtual-a100/220718?u=lromor