NVML_ERROR_NOT_SUPPORTED exception
See original GitHub issue🐛 Describe the bug
Sometimes it can occur that NVML does not support monitoring queries to specific devices. Currently this leads to failing the startup phase.
Error logs
2022-07-04T12:33:15,023 [ERROR] Thread-20 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in <module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 91, in collect_all
value(num_of_gpu)
File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 72, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
return [device_status(device_index) for device_index in range(device_count)]
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 26, in device_status
temperature = nv.nvmlDeviceGetTemperature(handle, nv.NVML_TEMPERATURE_GPU)
File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 1956, in nvmlDeviceGetTemperature
_nvmlCheckReturn(ret)
File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
Installation instructions
pytorch/torchserve:latest-gpu
Model Packaing
N/A
config.properties
No response
Versions
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.6.0
torch-model-archiver==0.6.0
Python version: 3.6 (64-bit runtime)
Python executable: /usr/bin/python3
Versions of relevant python libraries:
future==0.18.2
numpy==1.19.5
nvgpu==0.9.0
psutil==5.9.1
requests==2.27.1
torch-model-archiver==0.6.0
torch-workflow-archiver==0.2.4
torchserve==0.6.0
wheel==0.30.0
**Warning: torch not present ..
**Warning: torchtext not present ..
**Warning: torchvision not present ..
**Warning: torchaudio not present ..
Java Version:
OS: N/A
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: N/A
CMake version: N/A
Repro instructions
run:
torchserve --start --foreground --model-store model-store/
Possible Solution
Deal with those exceptions.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Handle GPUs that lack full NVML Support · Issue #16 - GitHub
Nvidia NVML does not support non-Tesla product very well. Problems are known with mobile cards and even Quadro cards.
Read more >Bug: NVML incorrectly detects certain GPUs as unsupported.
Hey all! It seems NVML has a bug where it incorrectly returns NVML_ERROR_NOT_SUPPORTED on certain calls for certain GPUs.
Read more >How to fix nvml.dll error Message! - YouTube
Quick video on how to fix an nvml.dll error message! This can be caused by a number of different factors but the solution...
Read more >Nvidia NVML Driver/library version mismatch - Stack Overflow
I am using Ubuntu and I think error occurs after Nvidia driver is updated on Linux. Maybe auto-remove and reboot is required after...
Read more >Spatial analysis image not running on my azure iotedge device
Spatial Analysis image is not running on my azure IoT edge device after running the deployment manifest file.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
In case anyone gets to a similar issue and would like to have a quick fix, I patched the code with:
Hi @msaroufim , I’ve opened an issue here: https://forums.developer.nvidia.com/t/nvml-issue-with-virtual-a100/220718?u=lromor