question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Simple local cluster on laptop fails with deactivated GPU

See original GitHub issue

What happened: Thinkpad T470p with dual Intel HD Graphics 630 / Nvidia GeForce 940MX GPU (‘optimus’ configuration). I use bumblebee to switch on/off the Nvidia GPU only when I need it for CUDA computational work to save power. The Nvidia is still visible but inactive when switched off this way.

When I start below code snippet with a deactivated Nvidia, it crashes:

$ sudo echo OFF > /proc/acpi/bbswitch
$ ./test_distributed.py 
Traceback (most recent call last):
  File "./test_distributed.py", line 6, in <module>
    client = dask.distributed.Client()
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 754, in __init__
    self.start(timeout=timeout)
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 967, in start
    sync(self.loop, self._start, **kwargs)
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/utils.py", line 354, in sync
    raise exc.with_traceback(tb)
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/utils.py", line 337, in f
    result[0] = yield future
  File "/home/sbauer/.local/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 1035, in _start
    **self._startup_kwargs,
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 401, in _
    await self._start()
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 311, in _start
    self.scheduler = cls(**self.scheduler_spec.get("options", {}))
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/scheduler.py", line 3382, in __init__
    import distributed.dashboard.scheduler
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/dashboard/scheduler.py", line 11, in <module>
    from .components.nvml import gpu_doc  # noqa: 1708
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/dashboard/components/nvml.py", line 22, in <module>
    NVML_ENABLED = nvml.device_get_count() > 0
  File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/diagnostics/nvml.py", line 31, in device_get_count
    return pynvml.nvmlDeviceGetCount()
  File "/home/sbauer/.local/lib/python3.7/site-packages/pynvml/nvml.py", line 1568, in nvmlDeviceGetCount
    _nvmlCheckReturn(ret)
  File "/home/sbauer/.local/lib/python3.7/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Uninitialized: Uninitialized

When I activate the Nvidia GPU, all is fine:

$ sudo echo ON > /proc/acpi/bbswitch
$ ./test_distributed.py 
<Client: 'tcp://127.0.0.1:42147' processes=4 threads=8, memory=15.55 GiB>

Even specifically setting CUDA_VISIBLE_DEVICES to empty doesn’t work (I use this env in Tensorflow to explicitly ignore the GPU):

$ CUDA_VISIBLE_DEVICES='' ./test_distributed.py
Traceback (most recent call last):
  File "./test_distributed.py", line 6, in <module>
...

What you expected to happen: Local cluster starts regardless of active or deactive GPU.

It would help if there were some possibility to tell distributed to ignore any present GPU, but I didn’t find anything in the documentation.

Minimal Complete Verifiable Example:

#!/usr/bin/env python3

import dask.distributed

if __name__ == "__main__":
    client = dask.distributed.Client()
    print(client)

Anything else we need to know?:

Environment:

  • Dask version: dask 2021.6.0 / distributed 2021.6.0
  • Python version: 3.7.3
  • Operating System: Linux 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
pentschevcommented, Jun 8, 2021

@SteffenBauer would you mind testing if https://github.com/dask/distributed/pull/4893 fixes the issue for you? You should have two options:

  1. Edit the distributed.yaml file, setting distributed.diagnostics.nvml=False; or
  2. Setting the environment variable DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False (False is case-sensitive here, make sure you write it exactly like that).
0reactions
pentschevcommented, Jun 9, 2021

Great! Thanks for testing and confirming again @SteffenBauer !

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Fix a Disabled Graphics Card on a Laptop or PC - Alphr
This article discusses ways to fix your disabled graphics card from the BIOS and the Windows OS, depending on how you initially disabled...
Read more >
Running GPU passthrough for a virtual desktop with Hyper-V
When a virtual desktop needs to run graphically intensive workloads, admins may want to assign a GPU to those virtual desktops.
Read more >
How can I troubleshoot GPU issues in a Kubernetes cluster?
The error message indicates that the number of available GPUs is smaller than the actual number of GPUs on the nodes in a...
Read more >
Install Kubernetes — NVIDIA Cloud Native Technologies ...
Install NVIDIA Dependencies​​ The GPU worker nodes in the Kubernetes cluster need to be enabled with the following components: NVIDIA drivers. NVIDIA Container ......
Read more >
Jupyter on the HPC Clusters - Princeton Research Computing
Note that if all the GPUs are in use then you will have to wait. To check what is available, from the OnDemand...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found