Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected behaviour of /health endpoint of Dask sheduler and workers

See original GitHub issue

What happened:

Each time when I call the /health HTTP endpoint on either the Dask scheduler and worker my logs get printed the following line:

distributed.comm.tcp - INFO - Connection closed before handshake completed

What is more, when looking at the output I see binary output (and not JSON):

$ wget --quiet -O - http://dask-scheduler:8786/health
:*��compression��python�	�

The status code seems to be 000:

$ curl --silent --output /dev/null -w "%{http_code}" https://dask-scheduler:8786/health
000

What you expected to happen:

I would expect the /health endpoints not to write to the logs when the status is ok. If something has to be printed I expect it to be less cryptic.
I would expect the /health endpoint just check if the service is alive.
I would expect the /health endpoint to return a status code of 200 when the service is ok.
I would expect the /health endpoint to maybe return some additional information as json.
I would expect the /health endpoint to work similar to (for example) Hasura’s /healthz endpoint.

Minimal Complete Verifiable Example:

Start Dask scheduler (or worker) in a container with service name dask-scheduler:

dask-scheduler \
  --dashboard \
  --port 8786 \
  --host dask-scheduler \
  --dashboard-address ":8787"

Call HTTP /health endpoint via curl:

curl --output - http://dask-scheduler:8786/health

or wget:

wget --quiet -O - http://dask-scheduler:8786/health

The dask scheduler can be replaced by a worker service with similar results.

Anything else we need to know?:

I’m trying to use the /health endpoint for Docker healthchecks of my Dask deployment. I want to avoid setting the logger to a higher level to still capture other messages, however, the /health endpoint gets called on a regular interval and thus the current message spams my logs.
I found a stackoverflow message of someone running into the same issue on K8s (I think because K8s does this healthchecks automatically).
Please let me know if I’m using the /health endpoint wrong. However, the status code of 000 together with the Connection closed before handshake completed message leads me to believe that somehow the connection is being disconnected prematurely.

Environment:

Dask version: 2021.04.0+24.g29e17a05 (master branch)
Python version: 3.9.2.final.0
Operating System: Linux 5.8.0-50-generic (Debian Buster slim)
Install method (conda, pip, source): pip in conda environment (git+https://github.com/dask/dask@main & git+https://github.com/dask/distributed@main)

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

peterroelantscommented, Apr 22, 2021

I see that starting the “–dashboard” does not just only start the dashboard, but also provides all other HTTP services (like the /health endpoint).

Starting the worker with its own dashboard by running:

dask-worker \
  --dashboard \
  --dashboard-address ":8789" \
  --host "dask-worker" \
  "tcp://dask-scheduler:8786"

and then using wget --quiet -O - http://dask-worker:8789/health worked for me 👍

Thank you for the help and clarifications. I will close this issue.

0reactions

fjettercommented, Apr 22, 2021

The servers start one HTTP server, by default on port 8787. This HTTP server serves the dashboard but also the other routes, like health, metrics, etc.

Note that the dashboard-address is different from the port. port is the server address for internal administration and dashboard-address is the one for the HTTP server.

I think what confused me was the documentation, which (to me at least) seemed to imply that the dashboard, scheduler, and worker each provided their own endpoints. I guess I was wrong in this interpretation.

You are right, the dask workers bring their own server and also have their own dashboard. You might need to check the ports for the worker server. If the given port is already in use, I believe, a free port is chosen automatically. Either way, the actually used addresses are logged after startup. Can you please check your logs to see if something suspicious can be seen there?