dask-worker fail to inform scheduler of closing with UCX
See original GitHub issuedask-worker
s fail to inform scheduler of their closing when protocol used is UCX. I spent quite some time trying to debug this, but I got really confused with the async state in distributed. What I figured is that the TCP stream in Tornado will hit
which will raise an exception and break out of
and cause the scheduler to close its end in
However, UCX isn’t a stream and will not raise such an exception, causing the scheduler to remain awaiting
The issue above should be taken care by
But as far as I could understand, the process exits before do_stop
gets executed, which is where I think the bug is but I wasn’t able to understand what of the async tasks causes the process to exit prematurely.
To reproduce this, one can open a scheduler and a worker as below and once the worker is connected press CTRL+C, which will not cause the scheduler to report that the worker has closed.
dask-scheduler --protocol ucx
dask-worker ucx://SERVER_IP:8786
@jacobtomlinson @pitrou appreciate if you guys can shed some light on this.
cc @quasiben
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
closed by #3747
This should be solved by PR ( https://github.com/dask/distributed/pull/3747 ).