Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU metrics are raising an exception

See original GitHub issue

Despite being initialized when imported the nvml startup info and metrics methods seem to raise an exception about NVML not being initialised. (rapidsai/dask-cuda#122)

distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/nanny.py", line 674, in run
    await worker
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 1016, in start
    await self._register_with_scheduler()
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 811, in _register_with_scheduler
    metrics=await self.get_metrics(),
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 740, in get_metrics
    result = await result
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 3406, in gpu_metric
    result = yield offload(nvml.real_time)
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/utils.py", line 1489, in offload
    return (yield _offload_executor.submit(fn, *args, **kwargs))
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in real_time
    "utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in <listcomp>
    "utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
  File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 1347, in nvmlDeviceGetUtilizationRates
    check_return(ret)
  File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Uninitialized: Uninitialized


pynvml.nvml.NVMLError_Uninitialized: Uninitialized

This exception will be safely passed over thanks to #2984 and #2991. Reasoning being that startup methods and metric methods can be supplied by the user and a bad method should not cause the worker to raise an exception.

However this pynvml issue is still happening and therefore GPU metrics will be broken.

Issue Analytics

State:
Created 4 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

lestevecommented, Aug 28, 2019

I am guessing this one can be closed now that #2993 has been merged?

0reactions

mivadecommented, Jun 1, 2021

Nevermind, my conda environment seems to have been in a weird state. I’m not sure exactly what caused it, but wiping and recreating it from scratch made this go away.

Top Results From Across the Web

GPU metrics error - Profiling Linux Targets

Incomplete events appear when GPU timestamp information have not been retrieved at the time the profiling session was stopped.

Monitor Amazon SageMaker with Amazon CloudWatch

You can monitor Amazon SageMaker using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics.

"RuntimeError: CUDA error: out of memory" - Stack Overflow

The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...

Troubleshooting - | fastai

If you have nvidia-smi working and pytorch still can't recognize your NVIDIA CPU, most likely your system has more than one version of...

GpuInfo — PyTorch-Ignite v0.4.10 Documentation

In case if gpu utilization reports “N/A” on a given GPU, corresponding metric value is not set. Examples. # Default GPU measurements GpuInfo().attach( ......