question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected behaviour of /health endpoint of Dask sheduler and workers

See original GitHub issue

What happened:

Each time when I call the /health HTTP endpoint on either the Dask scheduler and worker my logs get printed the following line:

distributed.comm.tcp - INFO - Connection closed before handshake completed

What is more, when looking at the output I see binary output (and not JSON):

$ wget --quiet -O - http://dask-scheduler:8786/health
:*��compression��python�	�

The status code seems to be 000:

$ curl --silent --output /dev/null -w "%{http_code}" https://dask-scheduler:8786/health
000

What you expected to happen:

  • I would expect the /health endpoints not to write to the logs when the status is ok. If something has to be printed I expect it to be less cryptic.
  • I would expect the /health endpoint just check if the service is alive.
  • I would expect the /health endpoint to return a status code of 200 when the service is ok.
  • I would expect the /health endpoint to maybe return some additional information as json.
  • I would expect the /health endpoint to work similar to (for example) Hasura’s /healthz endpoint.

Minimal Complete Verifiable Example:

Start Dask scheduler (or worker) in a container with service name dask-scheduler:

dask-scheduler \
  --dashboard \
  --port 8786 \
  --host dask-scheduler \
  --dashboard-address ":8787"

Call HTTP /health endpoint via curl:

curl --output - http://dask-scheduler:8786/health 

or wget:

wget --quiet -O - http://dask-scheduler:8786/health

The dask scheduler can be replaced by a worker service with similar results.

Anything else we need to know?:

  • I’m trying to use the /health endpoint for Docker healthchecks of my Dask deployment. I want to avoid setting the logger to a higher level to still capture other messages, however, the /health endpoint gets called on a regular interval and thus the current message spams my logs.
  • I found a stackoverflow message of someone running into the same issue on K8s (I think because K8s does this healthchecks automatically).
  • Please let me know if I’m using the /health endpoint wrong. However, the status code of 000 together with the Connection closed before handshake completed message leads me to believe that somehow the connection is being disconnected prematurely.

Environment:

  • Dask version: 2021.04.0+24.g29e17a05 (master branch)
  • Python version: 3.9.2.final.0
  • Operating System: Linux 5.8.0-50-generic (Debian Buster slim)
  • Install method (conda, pip, source): pip in conda environment (git+https://github.com/dask/dask@main & git+https://github.com/dask/distributed@main)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
peterroelantscommented, Apr 22, 2021

I see that starting the “–dashboard” does not just only start the dashboard, but also provides all other HTTP services (like the /health endpoint).

Starting the worker with its own dashboard by running:

dask-worker \
  --dashboard \
  --dashboard-address ":8789" \
  --host "dask-worker" \
  "tcp://dask-scheduler:8786"

and then using wget --quiet -O - http://dask-worker:8789/health worked for me 👍

Thank you for the help and clarifications. I will close this issue.

0reactions
fjettercommented, Apr 22, 2021

The servers start one HTTP server, by default on port 8787. This HTTP server serves the dashboard but also the other routes, like health, metrics, etc.

Note that the dashboard-address is different from the port. port is the server address for internal administration and dashboard-address is the one for the HTTP server.

I think what confused me was the documentation, which (to me at least) seemed to imply that the dashboard, scheduler, and worker each provided their own endpoints. I guess I was wrong in this interpretation.

You are right, the dask workers bring their own server and also have their own dashboard. You might need to check the ports for the worker server. If the given port is already in use, I believe, a free port is chosen automatically. Either way, the actually used addresses are logged after startup. Can you please check your logs to see if something suspicious can be seen there?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unusual behaviour when scheduler cannot route to worker
While debugging an unrelated issue I've found some strange behaviour when a worker connects to a scheduler but the scheduler is not able...
Read more >
Changelog — Dask.distributed 2022.12.1 documentation
Display unexpected state in Worker.execute validation (GH#6856) James Bourbeau ... Add basic health endpoints to scheduler and worker bokeh.
Read more >
Dask in production: Multi-Scheduler architectures - Coiled.io
I ran across an interesting problem yesterday: A company wanted to serve many Dask computations behind a web API endpoint.
Read more >
What do KilledWorker exceptions mean in Dask?
This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly....
Read more >
Changelog — Dask.distributed 2.11.0 documentation
Make behavior clearer for how to get worker dashboard (#4047) Julia Signell ... Add basic health endpoints to scheduler and worker bokeh.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found