Unable to catch etcd cluster failure
See original GitHub issueI’m using the 0.3.0 version of jetcd to observe an etcd cluster composed of 3 nodes in Docker. After setting up the Watch object, I tried to stop all 3 nodes so that I could handle errors within my microservice. At this point two things happened:
- The Watch object simply did not react to this change for a long time: no exceptions were raised. I wonder how jetcd checks the health of the etcd cluster: is there a mechanism under the hood similar to heartbeat that I can configure to capture the cluster’s unavailability much faster?
- After a while I received a NullPointerException launched on WatchImpl’s line 250 within the onError method (
stream.onCompleted ()
):
2019-03-16 18:33:18.168 ERROR 1 --- [ault-executor-1] io.grpc.internal.SerializingExecutor : Exception while executing runnable io.grpc.inte
rnal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@3e14b715
java.lang.NullPointerException: null
at io.etcd.jetcd.WatchImpl$WatcherImpl.onError(WatchImpl.java:250) ~[jetcd-core-0.3.0.jar:na]
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434) ~[grpc-stub-1.17.1.jar:1.17.1]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.
at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:302) ~[jetcd-core-0.3.0.jar:na]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584) ~[grpc-core-1.17.1.jar:1.1
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.17.1.jar:1.17.1]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.17.1.jar:1.17.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_191]
It is rare that a system distributed over different geographical areas is completely unavailable but in this scenario, I still want to be able to cleanly manage the errors within my application. Thank you
Issue Analytics
- State:
- Created 5 years ago
- Comments:22 (4 by maintainers)
Top Results From Across the Web
Failure modes - etcd
When a leader fails, the etcd cluster automatically elects a new leader. The election does not happen instantly once the leader fails. It...
Read more >Troubleshooting 'unable to connect to etcd' Error Message
This issue is cause by the Ondat daemonset pods not being able to connect to the etcd cluster. This is generally caused by...
Read more >Etcd failing to setup cluster due to failure to find etcd local ...
I'm attempting to setup a cluster on Ubuntu 18.04 host machines. I'm getting the following error when using DNS for server discovery.
Read more >Breaking down and fixing etcd cluster | by Andrei Kvapil
This article will be fully devoted to restoring an etcd-cluster. ... Failed to get the status of endpoint https://10.20.30.103:2379 (context ...
Read more >Testing of Etcd Failure - Zhimin Wen - Medium
An odd number of nodes establishes an high-availability etcd cluster. When the different number of nodes starts to fail, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I am having a similar issue. It is not clear to me how connection issues should be handled by a user of jetcd. I believe that the onError in watchImpl checks for a halt error or no leader error, neither of which match because the cluster is entirely unavailable. This would be useful to propagate to the caller because the caller may need to handle not having connectivity and act accordingly.
I’m not sure why
stream.onCompleted()
winds up being a null pointer exception, but it appears to stop the process of attempting to reconnect. This means the client will never connect after this error. Is there a way for a caller to check if the connection is still trying to be established? It may be beneficial for the caller to wait to complete some other work until after the connection is established.This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.