Make composite healthcheck for ServiceManager
See original GitHub issuePer #5850 we should have a /health endpoint for our service state. Composite services like ServiceManager should report their individual components, manifesting an aggregate healthy IFF all are healthy. This helps ensure traffic doesn’t pass through when things aren’t finished booting up, or if they crash after the fact.
This was discussed in https://apache-pinot.slack.com/archives/CDRCA57FC/p1598498630061400
Make a composite health endpoint similar to this:
@GET
@Produces(MediaType.APPLICATION_JSON)
@Path("/instances")
@ApiOperation(value = "Get Pinot Instances Status")
@ApiResponses(value = {@ApiResponse(code = 200, message = "Instance Status"), @ApiResponse(code = 500, message = "Internal server error")})
public Map<String, PinotInstanceStatus> getPinotAllInstancesStatus() {
Map<String, PinotInstanceStatus> results = new HashMap<>();
for (String instanceId : _pinotServiceManager.getRunningInstanceIds()) {
results.put(instanceId, _pinotServiceManager.getInstanceStatus(instanceId));
}
return results;
}
and in doing so it similar to discovering and calling each of https://github.com/apache/incubator-pinot/pull/5846
main thing is to represent the composite of its health so you can know if the process should be in service or not.
for example, i noticed one part of process fail in docker due to zip extraction maybe take too long no idea. still passes health check! that’s bad as it fails other thing.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
no I haven’t seen partial failure scenario using this. I will switch to it. meanwhile, it might be work back filling a test that proves PinotServiceManagerHealthCheck gives 503 on partial failure or if already does, link that and close this out.
it was a jar error ultimately during bootstrap which no longer exists (after extracting all of them). sadly I don’t have a copy of the message. the surprise was that the listener still worked heh. I think this error will be less possible after recent commit which checks exception and boolean status strictly.