Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Conceptual problem in connection handling

See original GitHub issue

To me it looks like we have a conceptual problem in the handling of connections to other services.

Every now and then, I recognized cases, where the adapters just don’t become ready. And so I was digging into the issue. In the log you can then find something like:

2019-09-02T17:11:38.515Z INFO  [HonoConnectionImpl] stopping connection attempt to server [host: iot-device-registry.enmasse-infra.svc, port: 5671] due to terminal error
javax.security.sasl.AuthenticationException: Failed to authenticate

This “terminal error” will stop the client from trying to re-connect, and it seems like it never will re-try again. For as long as process is alive. So effectively, the process would be dead now.

My assumption was, that the pod would get cleaned up at some point, due to failing health checks. However that also doesn’t seem to be the case. Taking a look at org.eclipse.hono.service.AbstractProtocolAdapterBase, you can see that the connections are checked in registerReadinessChecks, but not in the registerLivenessChecks method.

In that case, the “liveness check” would still always succeed.

Taking a look at the Kubernetes documentation (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/):

The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs. The kubelet uses readiness probes to know when a Container is ready to start accepting traffic. A Pod is considered ready when all of its Containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

It looks to me, as if the correct way to signal the “terminal error”, is not through the “readiness” check, but through the “liveness” check instead.

Additionally, I would question the idea of a terminal error altogether. From what I see in our case, it has to do with an invalid service credential. So that problem will go away in a minute.

I think we should:

Trigger a liveness failure in case of a terminal error.
Event for a terminal error, allow to try again.

Fixing the first point is required to solve the issue. The second point would only allow the pod to recover itself, before it get killed by Kubernetes, and might be more of an improvement to the situation.

I also think we should fix this before 1.0.0!