Unexpected behaviour of /health endpoint of Dask sheduler and workers
See original GitHub issueWhat happened:
Each time when I call the /health
HTTP endpoint on either the Dask scheduler and worker my logs get printed the following line:
distributed.comm.tcp - INFO - Connection closed before handshake completed
What is more, when looking at the output I see binary output (and not JSON):
$ wget --quiet -O - http://dask-scheduler:8786/health
:*��compression��python� �
The status code seems to be 000
:
$ curl --silent --output /dev/null -w "%{http_code}" https://dask-scheduler:8786/health
000
What you expected to happen:
- I would expect the
/health
endpoints not to write to the logs when the status is ok. If something has to be printed I expect it to be less cryptic. - I would expect the
/health
endpoint just check if the service is alive. - I would expect the
/health
endpoint to return a status code of 200 when the service is ok. - I would expect the
/health
endpoint to maybe return some additional information as json. - I would expect the
/health
endpoint to work similar to (for example) Hasura’s /healthz endpoint.
Minimal Complete Verifiable Example:
Start Dask scheduler (or worker) in a container with service name dask-scheduler
:
dask-scheduler \
--dashboard \
--port 8786 \
--host dask-scheduler \
--dashboard-address ":8787"
Call HTTP /health
endpoint via curl
:
curl --output - http://dask-scheduler:8786/health
or wget
:
wget --quiet -O - http://dask-scheduler:8786/health
The dask scheduler can be replaced by a worker service with similar results.
Anything else we need to know?:
- I’m trying to use the
/health
endpoint for Docker healthchecks of my Dask deployment. I want to avoid setting the logger to a higher level to still capture other messages, however, the/health
endpoint gets called on a regular interval and thus the current message spams my logs. - I found a stackoverflow message of someone running into the same issue on K8s (I think because K8s does this healthchecks automatically).
- Please let me know if I’m using the
/health
endpoint wrong. However, the status code of000
together with theConnection closed before handshake completed
message leads me to believe that somehow the connection is being disconnected prematurely.
Environment:
- Dask version: 2021.04.0+24.g29e17a05 (master branch)
- Python version: 3.9.2.final.0
- Operating System: Linux 5.8.0-50-generic (Debian Buster slim)
- Install method (conda, pip, source): pip in conda environment (
git+https://github.com/dask/dask@main
&git+https://github.com/dask/distributed@main
)
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Unusual behaviour when scheduler cannot route to worker
While debugging an unrelated issue I've found some strange behaviour when a worker connects to a scheduler but the scheduler is not able...
Read more >Changelog — Dask.distributed 2022.12.1 documentation
Display unexpected state in Worker.execute validation (GH#6856) James Bourbeau ... Add basic health endpoints to scheduler and worker bokeh.
Read more >Dask in production: Multi-Scheduler architectures - Coiled.io
I ran across an interesting problem yesterday: A company wanted to serve many Dask computations behind a web API endpoint.
Read more >What do KilledWorker exceptions mean in Dask?
This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly....
Read more >Changelog — Dask.distributed 2.11.0 documentation
Make behavior clearer for how to get worker dashboard (#4047) Julia Signell ... Add basic health endpoints to scheduler and worker bokeh.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I see that starting the “–dashboard” does not just only start the dashboard, but also provides all other HTTP services (like the
/health
endpoint).Starting the worker with its own dashboard by running:
and then using
wget --quiet -O - http://dask-worker:8789/health
worked for me 👍Thank you for the help and clarifications. I will close this issue.
The servers start one HTTP server, by default on port
8787
. This HTTP server serves the dashboard but also the other routes, like health, metrics, etc.Note that the
dashboard-address
is different from theport
.port
is the server address for internal administration anddashboard-address
is the one for the HTTP server.You are right, the dask workers bring their own server and also have their own dashboard. You might need to check the ports for the worker server. If the given port is already in use, I believe, a free port is chosen automatically. Either way, the actually used addresses are logged after startup. Can you please check your logs to see if something suspicious can be seen there?