question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU metrics are raising an exception

See original GitHub issue

Despite being initialized when imported the nvml startup info and metrics methods seem to raise an exception about NVML not being initialised. (rapidsai/dask-cuda#122)

distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/nanny.py", line 674, in run
    await worker
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 1016, in start
    await self._register_with_scheduler()
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 811, in _register_with_scheduler
    metrics=await self.get_metrics(),
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 740, in get_metrics
    result = await result
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 3406, in gpu_metric
    result = yield offload(nvml.real_time)
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/utils.py", line 1489, in offload
    return (yield _offload_executor.submit(fn, *args, **kwargs))
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in real_time
    "utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in <listcomp>
    "utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
  File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 1347, in nvmlDeviceGetUtilizationRates
    check_return(ret)
  File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Uninitialized: Uninitialized


pynvml.nvml.NVMLError_Uninitialized: Uninitialized

This exception will be safely passed over thanks to #2984 and #2991. Reasoning being that startup methods and metric methods can be supplied by the user and a bad method should not cause the worker to raise an exception.

However this pynvml issue is still happening and therefore GPU metrics will be broken.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
lestevecommented, Aug 28, 2019

I am guessing this one can be closed now that #2993 has been merged?

0reactions
mivadecommented, Jun 1, 2021

Nevermind, my conda environment seems to have been in a weird state. I’m not sure exactly what caused it, but wiping and recreating it from scratch made this go away.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GPU metrics error - Profiling Linux Targets
Incomplete events appear when GPU timestamp information have not been retrieved at the time the profiling session was stopped.
Read more >
Monitor Amazon SageMaker with Amazon CloudWatch
You can monitor Amazon SageMaker using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics.
Read more >
"RuntimeError: CUDA error: out of memory" - Stack Overflow
The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...
Read more >
Troubleshooting - | fastai
If you have nvidia-smi working and pytorch still can't recognize your NVIDIA CPU, most likely your system has more than one version of...
Read more >
GpuInfo — PyTorch-Ignite v0.4.10 Documentation
In case if gpu utilization reports “N/A” on a given GPU, corresponding metric value is not set. Examples. # Default GPU measurements GpuInfo().attach( ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found