question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Conceptual problem in connection handling

See original GitHub issue

To me it looks like we have a conceptual problem in the handling of connections to other services.

Every now and then, I recognized cases, where the adapters just don’t become ready. And so I was digging into the issue. In the log you can then find something like:

2019-09-02T17:11:38.515Z INFO  [HonoConnectionImpl] stopping connection attempt to server [host: iot-device-registry.enmasse-infra.svc, port: 5671] due to terminal error
javax.security.sasl.AuthenticationException: Failed to authenticate

This “terminal error” will stop the client from trying to re-connect, and it seems like it never will re-try again. For as long as process is alive. So effectively, the process would be dead now.

My assumption was, that the pod would get cleaned up at some point, due to failing health checks. However that also doesn’t seem to be the case. Taking a look at org.eclipse.hono.service.AbstractProtocolAdapterBase, you can see that the connections are checked in registerReadinessChecks, but not in the registerLivenessChecks method.

In that case, the “liveness check” would still always succeed.

Taking a look at the Kubernetes documentation (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/):

The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs. The kubelet uses readiness probes to know when a Container is ready to start accepting traffic. A Pod is considered ready when all of its Containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

It looks to me, as if the correct way to signal the “terminal error”, is not through the “readiness” check, but through the “liveness” check instead.

Additionally, I would question the idea of a terminal error altogether. From what I see in our case, it has to do with an invalid service credential. So that problem will go away in a minute.

I think we should:

  1. Trigger a liveness failure in case of a terminal error.
  2. Event for a terminal error, allow to try again.

Fixing the first point is required to solve the issue. The second point would only allow the pod to recover itself, before it get killed by Kubernetes, and might be more of an improvement to the situation.

I also think we should fix this before 1.0.0!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
sophokles73commented, Sep 16, 2019

@ctron can this be closed?

1reaction
ctroncommented, Sep 3, 2019

@sophokles73 So, I will move on with #1473 and simply drop the “terminal” stuff

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Develop Conceptual Thinking | Indeed.com
Seek outside knowledge. Conceptual thinking depends on abstract connections. Seek outsider information to solve insider problems. Look to ...
Read more >
Connection management in HTTP/1.x - MDN Web Docs - Mozilla
Connection management is a key topic in HTTP: opening and maintaining connections largely impacts the performance of Web sites and Web ...
Read more >
What is a conceptual framework? Definition, theory and example
A Conceptual Framework is a visual representation in research that illustrates the expected relationship between cause and effect.
Read more >
Transient connectivity errors - Azure Database for MySQL
Learn how to handle transient connectivity errors and connect efficiently to Azure Database for MySQL.
Read more >
Chapter 8 Connection Pooling with Connector/J
When not processing a transaction, the connection sits idle. Connection pooling enables the idle connection to be used by some other thread to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found