Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug] dask cuda worker died for race condition

See original GitHub issue

dask cuda worker sometimes failing for race condition to create its storage space -

(gdf) [pradghos@host  dask_start]$ dask-cuda-worker "tcp://9.3.89.66:8786"
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.nanny - INFO -         Start Nanny at: 'tcp://9.3.89.135:39097'
distributed.nanny - INFO -         Start Nanny at: 'tcp://9.3.89.135:35201'
distributed.nanny - INFO -         Start Nanny at: 'tcp://9.3.89.135:38867'
distributed.nanny - INFO -         Start Nanny at: 'tcp://9.3.89.135:32983'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
Process Dask Worker process (from Nanny):
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Traceback (most recent call last):
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/nanny.py", line 674, in _run
    worker = Worker(**worker_kwargs)
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/worker.py", line 542, in __init__
    self.data = data[0](**data[1])
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/dask_cuda/device_host_file.py", line 124, in __init__
    self.disk_func = Func(serialize_bytelist, deserialize_bytes, File(path))
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/zict/file.py", line 63, in __init__
    os.mkdir(self.directory)
FileExistsError: [Errno 17] File exists: 'storage'
distributed.nanny - INFO - Worker process 19311 exited with status 1
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7ffd7728a198>>, <Task finished coro=<Nanny._on_exit() done, defined at /mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/nanny.py:396> exception=TypeError('addresses should be strings or tuples, got None',)>)

However, other three worker started and serving the request - because of this issue - We might have observed less worker running than requested !

example - below user has started 8 worker but one worker died because of the race condition - and rest are serving the request -

Client
    Scheduler: tcp://9.3.89.66:8786
    Dashboard: http://9.3.89.66:8787/status 
Cluster
    Workers: 7
    Cores: 7
    Memory: 350.00 GB

Issue Analytics

State:
Created 4 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

1reaction

pentschevcommented, Feb 24, 2020

@pradghos your zict PR has been merged, thanks for the work there! I think it should solve the issue here, so I’ll tentatively close it here, but feel free to reopen should you encounter any related issues.

1reaction

pentschevcommented, Jan 30, 2020

Ok, you’re replying from email and I’m doing that from the GH interface, I didn’t see those because they were edited out. @pradghos I agree with @mrocklin that you had good suggestions for a fix, not sure why you edited them out. 😃

If you’re up to filing a PR to resolve the issue @pradghos , I suggest then you then use your preferred method. If you can’t file a PR then I’ll do that tomorrow or Monday.

Top Results From Across the Web

Why did my worker die? - Dask.distributed

KilledWorker : this means that a particular task was tried on a worker, and it died, and then the same task was sent...

Changelog — Dask.distributed 2022.12.1 documentation

Respect death timeout when waiting for scheduler file (GH#7296) Florian Jetter ... Harden preamble of Worker.execute against race conditions (GH#6878) ...

Source code for distributed.worker

Dask.distributed 2022.12.1 documentation ... Why did my worker die? Additional Features. Actors · Asynchronous Operation · HTTP endpoints · Publish Datasets ...

Futures - Dask documentation

If that worker becomes overburdened or dies, then there is no opportunity to recover the workload. Because Actors avoid the central scheduler they...

API — Dask.distributed 2022.12.1 documentation

Registers a lifecycle worker plugin for all current and future workers. Client.replicate (futures[, n, workers, ...]) Set replication of futures ...