question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to catch etcd cluster failure

See original GitHub issue

I’m using the 0.3.0 version of jetcd to observe an etcd cluster composed of 3 nodes in Docker. After setting up the Watch object, I tried to stop all 3 nodes so that I could handle errors within my microservice. At this point two things happened:

  1. The Watch object simply did not react to this change for a long time: no exceptions were raised. I wonder how jetcd checks the health of the etcd cluster: is there a mechanism under the hood similar to heartbeat that I can configure to capture the cluster’s unavailability much faster?
  2. After a while I received a NullPointerException launched on WatchImpl’s line 250 within the onError method (stream.onCompleted ()):
2019-03-16 18:33:18.168 ERROR 1 --- [ault-executor-1] io.grpc.internal.SerializingExecutor     : Exception while executing runnable io.grpc.inte
rnal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@3e14b715
java.lang.NullPointerException: null
      at io.etcd.jetcd.WatchImpl$WatcherImpl.onError(WatchImpl.java:250) ~[jetcd-core-0.3.0.jar:na]
      at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434) ~[grpc-stub-1.17.1.jar:1.17.1]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:302) ~[jetcd-core-0.3.0.jar:na]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584) ~[grpc-core-1.17.1.jar:1.1

      at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.17.1.jar:1.17.1]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_191]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_191]
      at java.lang.Thread.run(Thread.java:748) [na:1.8.0_191]

It is rare that a system distributed over different geographical areas is completely unavailable but in this scenario, I still want to be able to cleanly manage the errors within my application. Thank you

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:22 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
sysadmindcommented, Jun 13, 2019

I am having a similar issue. It is not clear to me how connection issues should be handled by a user of jetcd. I believe that the onError in watchImpl checks for a halt error or no leader error, neither of which match because the cluster is entirely unavailable. This would be useful to propagate to the caller because the caller may need to handle not having connectivity and act accordingly.

I’m not sure why stream.onCompleted() winds up being a null pointer exception, but it appears to stop the process of attempting to reconnect. This means the client will never connect after this error. Is there a way for a caller to check if the connection is still trying to be established? It may be beneficial for the caller to wait to complete some other work until after the connection is established.

0reactions
github-actions[bot]commented, Dec 3, 2022

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Failure modes - etcd
When a leader fails, the etcd cluster automatically elects a new leader. The election does not happen instantly once the leader fails. It...
Read more >
Troubleshooting 'unable to connect to etcd' Error Message
This issue is cause by the Ondat daemonset pods not being able to connect to the etcd cluster. This is generally caused by...
Read more >
Etcd failing to setup cluster due to failure to find etcd local ...
I'm attempting to setup a cluster on Ubuntu 18.04 host machines. I'm getting the following error when using DNS for server discovery.
Read more >
Breaking down and fixing etcd cluster | by Andrei Kvapil
This article will be fully devoted to restoring an etcd-cluster. ... Failed to get the status of endpoint https://10.20.30.103:2379 (context ...
Read more >
Testing of Etcd Failure - Zhimin Wen - Medium
An odd number of nodes establishes an high-availability etcd cluster. When the different number of nodes starts to fail, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found