question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask-worker fail to inform scheduler of closing with UCX

See original GitHub issue

dask-workers fail to inform scheduler of their closing when protocol used is UCX. I spent quite some time trying to debug this, but I got really confused with the async state in distributed. What I figured is that the TCP stream in Tornado will hit

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/comm/tcp.py#L205

which will raise an exception and break out of

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/core.py#L456

and cause the scheduler to close its end in

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/core.py#L491

However, UCX isn’t a stream and will not raise such an exception, causing the scheduler to remain awaiting

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/core.py#L456

The issue above should be taken care by

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/nanny.py#L674-L699

But as far as I could understand, the process exits before do_stop gets executed, which is where I think the bug is but I wasn’t able to understand what of the async tasks causes the process to exit prematurely.

To reproduce this, one can open a scheduler and a worker as below and once the worker is connected press CTRL+C, which will not cause the scheduler to report that the worker has closed.

dask-scheduler --protocol ucx
dask-worker ucx://SERVER_IP:8786

@jacobtomlinson @pitrou appreciate if you guys can shed some light on this.

cc @quasiben

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
quasibencommented, Apr 30, 2020

closed by #3747

0reactions
jakirkhamcommented, Apr 29, 2020

This should be solved by PR ( https://github.com/dask/distributed/pull/3747 ).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for distributed.worker - Dask documentation
... pragma: nocover # Worker is in a very broken state if closing fails. ... Workers keep the scheduler informed of their data...
Read more >
Changelog — Dask.distributed 2022.12.1 documentation
This release changes the default scheduling mode to use queuing. This will significantly reduce cluster memory use in most cases, and generally improve ......
Read more >
FAQ - Dask documentation
¶ Yes, Dask is resilient to the failure of worker nodes. It knows how it came to any result, and can replay the...
Read more >
Configuration - Dask documentation
A list of handlers to exclude The scheduler operates by receiving messages from various workers and clients and then performing operations based on...
Read more >
Why did my worker die? - Dask.distributed
scheduler.allowed-failures ), Dask decides to blame the task itself, and returns this exception. Note, that it is possible for a task to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found