[bug] dask cuda worker died for race condition
See original GitHub issuedask cuda worker sometimes failing for race condition to create its storage
space -
(gdf) [pradghos@host dask_start]$ dask-cuda-worker "tcp://9.3.89.66:8786"
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.nanny - INFO - Start Nanny at: 'tcp://9.3.89.135:39097'
distributed.nanny - INFO - Start Nanny at: 'tcp://9.3.89.135:35201'
distributed.nanny - INFO - Start Nanny at: 'tcp://9.3.89.135:38867'
distributed.nanny - INFO - Start Nanny at: 'tcp://9.3.89.135:32983'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
Process Dask Worker process (from Nanny):
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Traceback (most recent call last):
File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/process.py", line 191, in _run
target(*args, **kwargs)
File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/nanny.py", line 674, in _run
worker = Worker(**worker_kwargs)
File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/worker.py", line 542, in __init__
self.data = data[0](**data[1])
File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/dask_cuda/device_host_file.py", line 124, in __init__
self.disk_func = Func(serialize_bytelist, deserialize_bytes, File(path))
File "/mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/zict/file.py", line 63, in __init__
os.mkdir(self.directory)
FileExistsError: [Errno 17] File exists: 'storage'
distributed.nanny - INFO - Worker process 19311 exited with status 1
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7ffd7728a198>>, <Task finished coro=<Nanny._on_exit() done, defined at /mnt/pai/home/pradghos/anaconda3/envs/gdf/lib/python3.6/site-packages/distributed/nanny.py:396> exception=TypeError('addresses should be strings or tuples, got None',)>)
However, other three worker started and serving the request - because of this issue - We might have observed less worker running than requested !
example - below user has started 8 worker but one worker died because of the race condition - and rest are serving the request -
Client
Scheduler: tcp://9.3.89.66:8786
Dashboard: http://9.3.89.66:8787/status
Cluster
Workers: 7
Cores: 7
Memory: 350.00 GB
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (8 by maintainers)
Top Results From Across the Web
Why did my worker die? - Dask.distributed
KilledWorker : this means that a particular task was tried on a worker, and it died, and then the same task was sent...
Read more >Changelog — Dask.distributed 2022.12.1 documentation
Respect death timeout when waiting for scheduler file (GH#7296) Florian Jetter ... Harden preamble of Worker.execute against race conditions (GH#6878) ...
Read more >Source code for distributed.worker
Dask.distributed 2022.12.1 documentation ... Why did my worker die? Additional Features. Actors · Asynchronous Operation · HTTP endpoints · Publish Datasets ...
Read more >Futures - Dask documentation
If that worker becomes overburdened or dies, then there is no opportunity to recover the workload. Because Actors avoid the central scheduler they...
Read more >API — Dask.distributed 2022.12.1 documentation
Registers a lifecycle worker plugin for all current and future workers. Client.replicate (futures[, n, workers, ...]) Set replication of futures ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@pradghos your zict PR has been merged, thanks for the work there! I think it should solve the issue here, so I’ll tentatively close it here, but feel free to reopen should you encounter any related issues.
Ok, you’re replying from email and I’m doing that from the GH interface, I didn’t see those because they were edited out. @pradghos I agree with @mrocklin that you had good suggestions for a fix, not sure why you edited them out. 😃
If you’re up to filing a PR to resolve the issue @pradghos , I suggest then you then use your preferred method. If you can’t file a PR then I’ll do that tomorrow or Monday.