GPU metrics are raising an exception
See original GitHub issueDespite being initialized when imported the nvml startup info and metrics methods seem to raise an exception about NVML not being initialised. (rapidsai/dask-cuda#122)
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/nanny.py", line 674, in run
await worker
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 1016, in start
await self._register_with_scheduler()
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 811, in _register_with_scheduler
metrics=await self.get_metrics(),
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 740, in get_metrics
result = await result
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 3406, in gpu_metric
result = yield offload(nvml.real_time)
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/utils.py", line 1489, in offload
return (yield _offload_executor.submit(fn, *args, **kwargs))
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in real_time
"utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in <listcomp>
"utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 1347, in nvmlDeviceGetUtilizationRates
check_return(ret)
File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 366, in check_return
raise NVMLError(ret)
pynvml.nvml.NVMLError_Uninitialized: Uninitialized
pynvml.nvml.NVMLError_Uninitialized: Uninitialized
This exception will be safely passed over thanks to #2984 and #2991. Reasoning being that startup methods and metric methods can be supplied by the user and a bad method should not cause the worker to raise an exception.
However this pynvml issue is still happening and therefore GPU metrics will be broken.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
GPU metrics error - Profiling Linux Targets
Incomplete events appear when GPU timestamp information have not been retrieved at the time the profiling session was stopped.
Read more >Monitor Amazon SageMaker with Amazon CloudWatch
You can monitor Amazon SageMaker using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics.
Read more >"RuntimeError: CUDA error: out of memory" - Stack Overflow
The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...
Read more >Troubleshooting - | fastai
If you have nvidia-smi working and pytorch still can't recognize your NVIDIA CPU, most likely your system has more than one version of...
Read more >GpuInfo — PyTorch-Ignite v0.4.10 Documentation
In case if gpu utilization reports “N/A” on a given GPU, corresponding metric value is not set. Examples. # Default GPU measurements GpuInfo().attach( ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I am guessing this one can be closed now that #2993 has been merged?
Nevermind, my conda environment seems to have been in a weird state. I’m not sure exactly what caused it, but wiping and recreating it from scratch made this go away.