Simple local cluster on laptop fails with deactivated GPU
See original GitHub issueWhat happened: Thinkpad T470p with dual Intel HD Graphics 630 / Nvidia GeForce 940MX GPU (‘optimus’ configuration). I use bumblebee to switch on/off the Nvidia GPU only when I need it for CUDA computational work to save power. The Nvidia is still visible but inactive when switched off this way.
When I start below code snippet with a deactivated Nvidia, it crashes:
$ sudo echo OFF > /proc/acpi/bbswitch
$ ./test_distributed.py
Traceback (most recent call last):
File "./test_distributed.py", line 6, in <module>
client = dask.distributed.Client()
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 754, in __init__
self.start(timeout=timeout)
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 967, in start
sync(self.loop, self._start, **kwargs)
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/utils.py", line 354, in sync
raise exc.with_traceback(tb)
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/utils.py", line 337, in f
result[0] = yield future
File "/home/sbauer/.local/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/client.py", line 1035, in _start
**self._startup_kwargs,
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 401, in _
await self._start()
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 311, in _start
self.scheduler = cls(**self.scheduler_spec.get("options", {}))
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/scheduler.py", line 3382, in __init__
import distributed.dashboard.scheduler
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/dashboard/scheduler.py", line 11, in <module>
from .components.nvml import gpu_doc # noqa: 1708
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/dashboard/components/nvml.py", line 22, in <module>
NVML_ENABLED = nvml.device_get_count() > 0
File "/home/sbauer/.local/lib/python3.7/site-packages/distributed/diagnostics/nvml.py", line 31, in device_get_count
return pynvml.nvmlDeviceGetCount()
File "/home/sbauer/.local/lib/python3.7/site-packages/pynvml/nvml.py", line 1568, in nvmlDeviceGetCount
_nvmlCheckReturn(ret)
File "/home/sbauer/.local/lib/python3.7/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_Uninitialized: Uninitialized
When I activate the Nvidia GPU, all is fine:
$ sudo echo ON > /proc/acpi/bbswitch
$ ./test_distributed.py
<Client: 'tcp://127.0.0.1:42147' processes=4 threads=8, memory=15.55 GiB>
Even specifically setting CUDA_VISIBLE_DEVICES to empty doesn’t work (I use this env in Tensorflow to explicitly ignore the GPU):
$ CUDA_VISIBLE_DEVICES='' ./test_distributed.py
Traceback (most recent call last):
File "./test_distributed.py", line 6, in <module>
...
What you expected to happen: Local cluster starts regardless of active or deactive GPU.
It would help if there were some possibility to tell distributed to ignore any present GPU, but I didn’t find anything in the documentation.
Minimal Complete Verifiable Example:
#!/usr/bin/env python3
import dask.distributed
if __name__ == "__main__":
client = dask.distributed.Client()
print(client)
Anything else we need to know?:
Environment:
- Dask version: dask 2021.6.0 / distributed 2021.6.0
- Python version: 3.7.3
- Operating System: Linux 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (9 by maintainers)
Top GitHub Comments
@SteffenBauer would you mind testing if https://github.com/dask/distributed/pull/4893 fixes the issue for you? You should have two options:
distributed.yaml
file, settingdistributed.diagnostics.nvml=False
; orDASK_DISTRIBUTED__DIAGNOSTICS__NVML=False
(False
is case-sensitive here, make sure you write it exactly like that).Great! Thanks for testing and confirming again @SteffenBauer !