question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Broker /health endpoint returns 200 OK when process first starts up

See original GitHub issue

After restarting a broker process, we noticed our brokers have their /health endpoint return 200s right after the process starts up. Queries sent to the broker during this time return with BrokerResourceMissingError in the exceptions field. This happens for several seconds and then the /health endpoint will return 503 until it’s done building its routing maps.

We’re concerned that our load balances might pick up brokers that are being restarted and were wondering if it’s possible for them report 503 instead until it’s ready to serve queries for all tables it’s assigned to.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jadami10commented, Dec 7, 2021

My second question here is, can we make it configurable for service status to go to 503 once we send TERM or HUP. Otherwise even though the broker will go down after draining requests, the load balancer will still think it’s safe to send requests that way.

0reactions
xiangfu0commented, Dec 8, 2021

For shutdown phase, both PinotServiceManager and Broker healthcheck should fail when you send the signal.

This is not the case as PinotServiceManager tries to stop the broker first, and the broker sleeps for default 10 seconds then just deregisters the handler. But #7880 seems to do all right here by just failing the PinotServiceManager instead. Are we comfortable rolling this out as the new default for everyone?

This PR enables health check when all bootstrap services are ready. When sending the signal, PinotServiceManager disable healthcheck first.

Thank you for doing this. This was my other thought at how to approach it, and it does cover all the services at once. I just have concerns that this approach is extremely prone to random deadlocks.

public static Status getServiceStatus() {
    return getServiceStatus(SERVICE_STATUS_CALLBACK);
  }

really needs to be limited in usage to only the /health endpoint. That said, I guess if you make it deadlock, then the integration tests fail, so maybe this is ok.

True, I’m making changes to not add PinotSM status check when the SM port < 0.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Production-ready Features - Spring
For example, the health endpoint provides basic application health ... If a @ReadOperation returns a value, the response status will be 200 (OK)....
Read more >
Azure Traffic Manager endpoint monitoring | Microsoft Learn
An endpoint is considered healthy if probing agent receives a 200-OK response, or any of the responses configured in the Expected status code ......
Read more >
Troubleshoot failing health checks for Application Load ...
You can specify values or ranges of values between 200 and 499. The default value is 200. Check your load balancer health check...
Read more >
Returning http 200 OK with error within response body
No, it's very incorrect to send 200 with a error body. HTTP is an application protocol. 200 implies that the response contains a...
Read more >
Adding health checks with Liveness, Readiness, and Startup ...
If the endpoint returns a status code from 200 to 399 , the probe is successful. Anything else is considered a failure. There...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found