question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Ignite Nodes disconnecting every now and then

See original GitHub issue

Describe the bug We were noticing our Ignite nodes disconnecting occasionally on the Dev cluster. So far twice on May 19, 2020.

Servers going from 2->1:

[23:03:23] Data Regions Configured:
[23:03:23]   ^-- default [initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false]
[23:03:36] Topology snapshot [ver=15376, servers=1, clients=17, CPUs=24, offheap=13.0GB, heap=22.0GB]
[23:03:36]   ^-- Node [id=D53D63B2-3AD3-47E9-B4CE-3E7B9EB2BB79, clusterState=ACTIVE]
[23:03:36] Data Regions Configured:
[23:03:36]   ^-- default [initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false]
[23:03:48] Topology snapshot [ver=15377, servers=2, clients=17, CPUs=32, offheap=25.0GB, heap=23.0GB]
[23:03:48]   ^-- Node [id=D53D63B2-3AD3-47E9-B4CE-3E7B9EB2BB79, clusterState=ACTIVE]
[23:03:48] Data Regions Configured:
[23:03:48]   ^-- default [initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false]
[23:03:48] Topology snapshot [ver=15378, servers=1, clients=17, CPUs=24, offheap=13.0GB, heap=22.0GB]
[23:03:48]   ^-- Node [id=D53D63B2-3AD3-47E9-B4CE-3E7B9EB2BB79, clusterState=ACTIVE]
[23:03:48] Data Regions Configured:
[23:03:48]   ^-- default [initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false]
[23:03:51] Topology snapshot [ver=15379, servers=2, clients=17, CPUs=32, offheap=25.0GB, heap=23.0GB]
[23:03:51]   ^-- Node [id=D53D63B2-3AD3-47E9-B4CE-3E7B9EB2BB79, clusterState=ACTIVE]
[23:03:51] Data Regions Configured:
[23:03:51]   ^-- default [initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false]
[23:04:07] Topology snapshot [ver=15380, servers=2, clients=16, CPUs=32, offheap=25.0GB, heap=23.0GB]
[23:04:07]   ^-- Node [id=D53D63B2-3AD3-47E9-B4CE-3E7B9EB2BB79, clusterState=ACTIVE]
[23:04:07] Data Regions Configured:
[23:04:07]   ^-- default [initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false]

[11:59:15,728][SEVERE][tcp-disco-ip-finder-cleaner-#5][TcpDiscoverySpi] Failed to clean IP finder up.
class org.apache.ignite.spi.IgniteSpiException: Failed to retrieve Ignite pods IP addresses.
at org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder.getRegisteredAddresses(TcpDiscoveryKubernetesIpFinder.java:172)
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.registeredAddresses(TcpDiscoverySpi.java:1828)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$IpFinderCleaner.cleanIpFinder(ServerImpl.java:1938)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$IpFinderCleaner.body(ServerImpl.java:1913)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.net.UnknownHostException: kubernetes.default.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:673)
at sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:173)
at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1564)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263)
at org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder.getRegisteredAddresses(TcpDiscoveryKubernetesIpFinder.java:153)

To Reproduce Steps to reproduce the behavior:

  1. execute epicli init … (with params)
  2. edit config file, there should be PostgreSQL and Ignite.
  3. execute epicli apply …

OS (please complete the following information):

  • OS: ???

Cloud Environment (please complete the following information):

  • Cloud Provider MS Azure

Additional context Add any other context about the problem here. Issue based on Jira request: EP-108.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
toszocommented, Dec 8, 2020

This is request is external - I do not have an information about cni used - I assume flannel (most of usacases is flannel). The problem may be related to K8s service discovery used in Ignite. @pyrkamarcin configured Ignite with Zookeeper and it worked without problems.

0reactions
rafzeicommented, Apr 26, 2021

@atsikham Could you please create a new gh issue to implement such functionality?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ignite Nodes disconnects after some time - Stack Overflow
Whenever the ignite nodes goes on idle they gets disconnected and after that time if i try to connect my first request gets...
Read more >
Client gets disconnected, bug with carbons? OF3.9.3 - #8 by CSH ...
It seems like this issue occurs only, if more than one message carbon is created, in which case the same message gets added...
Read more >
Baseline Topology | Ignite Documentation
The baseline topology is a set of nodes meant to hold data. ... you add 2 more nodes, the rebalancing process re-distributes the...
Read more >
Expected Behaviors when Failures Occur - PTC Support
Restart all Platform nodes. If ignite is not restarted, bind maps and other data stored in ignite will not be correct and cause...
Read more >
apacheignite/ignite - Gitter
Is there is some way to put entry in IgniteCache ignoring any affinity ... one: We regulary get crashes on client nodes after...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found