Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to catch etcd cluster failure

See original GitHub issue

I’m using the 0.3.0 version of jetcd to observe an etcd cluster composed of 3 nodes in Docker. After setting up the Watch object, I tried to stop all 3 nodes so that I could handle errors within my microservice. At this point two things happened:

The Watch object simply did not react to this change for a long time: no exceptions were raised. I wonder how jetcd checks the health of the etcd cluster: is there a mechanism under the hood similar to heartbeat that I can configure to capture the cluster’s unavailability much faster?
After a while I received a NullPointerException launched on WatchImpl’s line 250 within the onError method (stream.onCompleted ()):

2019-03-16 18:33:18.168 ERROR 1 --- [ault-executor-1] io.grpc.internal.SerializingExecutor     : Exception while executing runnable io.grpc.inte
rnal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@3e14b715
java.lang.NullPointerException: null
      at io.etcd.jetcd.WatchImpl$WatcherImpl.onError(WatchImpl.java:250) ~[jetcd-core-0.3.0.jar:na]
      at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434) ~[grpc-stub-1.17.1.jar:1.17.1]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:302) ~[jetcd-core-0.3.0.jar:na]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584) ~[grpc-core-1.17.1.jar:1.1

      at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.17.1.jar:1.17.1]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_191]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_191]
      at java.lang.Thread.run(Thread.java:748) [na:1.8.0_191]

It is rare that a system distributed over different geographical areas is completely unavailable but in this scenario, I still want to be able to cleanly manage the errors within my application. Thank you

Issue Analytics

State:
Created 5 years ago
Comments:22 (4 by maintainers)

Top GitHub Comments

2reactions

sysadmindcommented, Jun 13, 2019

I am having a similar issue. It is not clear to me how connection issues should be handled by a user of jetcd. I believe that the onError in watchImpl checks for a halt error or no leader error, neither of which match because the cluster is entirely unavailable. This would be useful to propagate to the caller because the caller may need to handle not having connectivity and act accordingly.

I’m not sure why stream.onCompleted() winds up being a null pointer exception, but it appears to stop the process of attempting to reconnect. This means the client will never connect after this error. Is there a way for a caller to check if the connection is still trying to be established? It may be beneficial for the caller to wait to complete some other work until after the connection is established.

0reactions

github-actions[bot]commented, Dec 3, 2022

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.