question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug] dask cuda worker died for race condition

See original GitHub issue

dask cuda worker sometimes failing for race condition to create its storage space -

(gdf) [pradghos@host  dask_start]$ dask-cuda-worker "tcp://9.3.89.66:8786"
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.nanny - INFO -         Start Nanny at: 'tcp://9.3.89.135:39097'
distributed.nanny - INFO -         Start Nanny at: 'tcp://9.3.89.135:35201'
distributed.nanny - INFO -         Start Nanny at: 'tcp://9.3.89.135:38867'
distributed.nanny - INFO -         Start Nanny at: 'tcp://9.3.89.135:32983'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
Process Dask Worker process (from Nanny):
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Traceback (most recent call last):
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/nanny.py", line 674, in _run
    worker = Worker(**worker_kwargs)
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/worker.py", line 542, in __init__
    self.data = data[0](**data[1])
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/dask_cuda/device_host_file.py", line 124, in __init__
    self.disk_func = Func(serialize_bytelist, deserialize_bytes, File(path))
  File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/zict/file.py", line 63, in __init__
    os.mkdir(self.directory)
FileExistsError: [Errno 17] File exists: 'storage'
distributed.nanny - INFO - Worker process 19311 exited with status 1
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7ffd7728a198>>, <Task finished coro=<Nanny._on_exit() done, defined at /mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/nanny.py:396> exception=TypeError('addresses should be strings or tuples, got None',)>)

However, other three worker started and serving the request - because of this issue - We might have observed less worker running than requested !

example - below user has started 8 worker but one worker died because of the race condition - and rest are serving the request -

Client
    Scheduler: tcp://9.3.89.66:8786
    Dashboard: http://9.3.89.66:8787/status 
Cluster
    Workers: 7
    Cores: 7
    Memory: 350.00 GB

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
pentschevcommented, Feb 24, 2020

@pradghos your zict PR has been merged, thanks for the work there! I think it should solve the issue here, so I’ll tentatively close it here, but feel free to reopen should you encounter any related issues.

1reaction
pentschevcommented, Jan 30, 2020

Ok, you’re replying from email and I’m doing that from the GH interface, I didn’t see those because they were edited out. @pradghos I agree with @mrocklin that you had good suggestions for a fix, not sure why you edited them out. 😃

If you’re up to filing a PR to resolve the issue @pradghos , I suggest then you then use your preferred method. If you can’t file a PR then I’ll do that tomorrow or Monday.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why did my worker die? - Dask.distributed
KilledWorker : this means that a particular task was tried on a worker, and it died, and then the same task was sent...
Read more >
Changelog — Dask.distributed 2022.12.1 documentation
Respect death timeout when waiting for scheduler file (GH#7296) Florian Jetter ... Harden preamble of Worker.execute against race conditions (GH#6878) ...
Read more >
Source code for distributed.worker
Dask.distributed 2022.12.1 documentation ... Why did my worker die? Additional Features. Actors · Asynchronous Operation · HTTP endpoints · Publish Datasets ...
Read more >
Futures - Dask documentation
If that worker becomes overburdened or dies, then there is no opportunity to recover the workload. Because Actors avoid the central scheduler they...
Read more >
API — Dask.distributed 2022.12.1 documentation
Registers a lifecycle worker plugin for all current and future workers. Client.replicate (futures[, n, workers, ...]) Set replication of futures ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found