Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask-worker fail to inform scheduler of closing with UCX

See original GitHub issue

dask-workers fail to inform scheduler of their closing when protocol used is UCX. I spent quite some time trying to debug this, but I got really confused with the async state in distributed. What I figured is that the TCP stream in Tornado will hit

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/comm/tcp.py#L205

which will raise an exception and break out of

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/core.py#L456

and cause the scheduler to close its end in

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/core.py#L491

However, UCX isn’t a stream and will not raise such an exception, causing the scheduler to remain awaiting

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/core.py#L456

The issue above should be taken care by

https://github.com/dask/distributed/blob/806a7e97285c5534b3e37e912cbc060a8036c56f/distributed/nanny.py#L674-L699

But as far as I could understand, the process exits before do_stop gets executed, which is where I think the bug is but I wasn’t able to understand what of the async tasks causes the process to exit prematurely.

To reproduce this, one can open a scheduler and a worker as below and once the worker is connected press CTRL+C, which will not cause the scheduler to report that the worker has closed.

dask-scheduler --protocol ucx
dask-worker ucx://SERVER_IP:8786

@jacobtomlinson @pitrou appreciate if you guys can shed some light on this.

cc @quasiben

Issue Analytics

State:
Created 4 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

quasibencommented, Apr 30, 2020

closed by #3747

0reactions

jakirkhamcommented, Apr 29, 2020

This should be solved by PR ( https://github.com/dask/distributed/pull/3747 ).

Top Results From Across the Web

Source code for distributed.worker - Dask documentation

... pragma: nocover # Worker is in a very broken state if closing fails. ... Workers keep the scheduler informed of their data...

Changelog — Dask.distributed 2022.12.1 documentation

This release changes the default scheduling mode to use queuing. This will significantly reduce cluster memory use in most cases, and generally improve ......

FAQ - Dask documentation

¶ Yes, Dask is resilient to the failure of worker nodes. It knows how it came to any result, and can replay the...

Configuration - Dask documentation

A list of handlers to exclude The scheduler operates by receiving messages from various workers and clients and then performing operations based on...

Why did my worker die? - Dask.distributed

scheduler.allowed-failures ), Dask decides to blame the task itself, and returns this exception. Note, that it is possible for a task to ...