Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Simple local cluster on laptop fails with deactivated GPU

See original GitHub issue

What happened: Thinkpad T470p with dual Intel HD Graphics 630 / Nvidia GeForce 940MX GPU (‘optimus’ configuration). I use bumblebee to switch on/off the Nvidia GPU only when I need it for CUDA computational work to save power. The Nvidia is still visible but inactive when switched off this way.

When I start below code snippet with a deactivated Nvidia, it crashes:

$ sudo echo OFF > /proc/acpi/bbswitch
$ ./test_distributed.py 
Traceback (most recent call last):
  File "./test_distributed.py", line 6, in <module>
    client = dask.distributed.Client()
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 754, in __init__
    self.start(timeout=timeout)
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 967, in start
    sync(self.loop, self._start, **kwargs)
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/utils.py", line 354, in sync
    raise exc.with_traceback(tb)
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/utils.py", line 337, in f
    result[0] = yield future
  File "/home/sbauer/.local/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 1035, in _start
    **self._startup_kwargs,
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 401, in _
    await self._start()
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 311, in _start
    self.scheduler = cls(**self.scheduler_spec.get("options", {}))
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/scheduler.py", line 3382, in __init__
    import distributed.dashboard.scheduler
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/dashboard/scheduler.py", line 11, in <module>
    from .components.nvml import gpu_doc  # noqa: 1708
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/dashboard/components/nvml.py", line 22, in <module>
    NVML_ENABLED = nvml.device_get_count() > 0
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/diagnostics/nvml.py", line 31, in device_get_count
    return pynvml.nvmlDeviceGetCount()
  File "/home/sbauer/.local/lib/python3.7/site-packages/pynvml/nvml.py", line 1568, in nvmlDeviceGetCount
    _nvmlCheckReturn(ret)
  File "/home/sbauer/.local/lib/python3.7/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Uninitialized: Uninitialized

When I activate the Nvidia GPU, all is fine:

$ sudo echo ON > /proc/acpi/bbswitch
$ ./test_distributed.py 
<Client: 'tcp://127.0.0.1:42147' processes=4 threads=8, memory=15.55 GiB>

Even specifically setting CUDA_VISIBLE_DEVICES to empty doesn’t work (I use this env in Tensorflow to explicitly ignore the GPU):

$ CUDA_VISIBLE_DEVICES='' ./test_distributed.py
Traceback (most recent call last):
  File "./test_distributed.py", line 6, in <module>
...

What you expected to happen: Local cluster starts regardless of active or deactive GPU.

It would help if there were some possibility to tell distributed to ignore any present GPU, but I didn’t find anything in the documentation.

Minimal Complete Verifiable Example:

#!/usr/bin/env python3

import dask.distributed

if __name__ == "__main__":
    client = dask.distributed.Client()
    print(client)

Anything else we need to know?:

Environment:

Dask version: dask 2021.6.0 / distributed 2021.6.0
Python version: 3.7.3
Operating System: Linux 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
Install method (conda, pip, source): pip

Issue Analytics

State:
Created 2 years ago
Comments:14 (9 by maintainers)

Top GitHub Comments

1reaction

pentschevcommented, Jun 8, 2021

@SteffenBauer would you mind testing if https://github.com/dask/distributed/pull/4893 fixes the issue for you? You should have two options:

Edit the distributed.yaml file, setting distributed.diagnostics.nvml=False; or
Setting the environment variable DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False (False is case-sensitive here, make sure you write it exactly like that).

0reactions

pentschevcommented, Jun 9, 2021

Great! Thanks for testing and confirming again @SteffenBauer !

Top Results From Across the Web

How to Fix a Disabled Graphics Card on a Laptop or PC - Alphr

This article discusses ways to fix your disabled graphics card from the BIOS and the Windows OS, depending on how you initially disabled...

Running GPU passthrough for a virtual desktop with Hyper-V

When a virtual desktop needs to run graphically intensive workloads, admins may want to assign a GPU to those virtual desktops.

How can I troubleshoot GPU issues in a Kubernetes cluster?

The error message indicates that the number of available GPUs is smaller than the actual number of GPUs on the nodes in a...

Install Kubernetes — NVIDIA Cloud Native Technologies ...

Install NVIDIA Dependencies The GPU worker nodes in the Kubernetes cluster need to be enabled with the following components: NVIDIA drivers. NVIDIA Container ......

Jupyter on the HPC Clusters - Princeton Research Computing

Note that if all the GPUs are in use then you will have to wait. To check what is available, from the OnDemand...