Hazelcast issue after enabling istio and injected envoy
See original GitHub issueHi
i’m having issue with hazelcast after enabling istio and i wonder how can i address this.
i have K8s cluster and i’ve recently installed istio. when trying to add envoy to deployment with hazelcast i have a wierd issue where i have many coinnections error during rolling upgrade. i should mention eventually the deployment is OK but this errors indicate something is wrong.
i’ve noticed that without Envoy when i’m doing rolling upgrade to a deployment i see the following:
[10.16.17.72]:5701 [dev] [4.0.1] Initialized new cluster connection between /10.16.17.72:45025 and /10.16.5.8:5701
[10.16.5.8]:5701 [dev] [4.0.1] Initialized new cluster connection between /10.16.5.8:5701 and /10.16.17.72:45025
[10.16.17.72]:5701 [dev] [4.0.1] Connection[id=1, /10.16.17.72:45025->/10.16.5.8:5701, qualifier=null, endpoint=[10.16.5.8]:5701, alive=false, connectionType=MEMBER] closed. Reason: Connection closed by the other side
[10.16.17.72]:5701 [dev] [4.0.1] Could not connect to: /10.16.5.8:5701. Reason: SocketException[Connection refused to address /10.16.5.8:5701]
.......
[10.16.17.72]:5701 [dev] [4.0.1] Removing connection to endpoint [10.16.5.8]:5701 Cause => java.net.SocketException {Connection refused to address /10.16.5.8:5701}, Error-Count: 5
[10.16.17.72]:5701 [dev] [4.0.1] Member [10.16.5.8]:5701 - 945ec2c8-fc56-4624-aab3-de9823d4886a is suspected to be dead for reason: No connection
what happens here is:
- new pod starting and joining the cluster.
- connection initialize between old-pod:5701 to new-pod:xxx (2 directions)
- new pod complains it cannot reach old pod (connectionType=MEMBER) , and after 5 attempts consider it as dead and remove it from the cluster
- old pod removed once rolling upgrade completes.
now, when i’m doing the same while injecting Envoy so i have 2 containers in this pod deployment, i’ve noticed the following:
[10.16.3.244]:5701 [dev] [4.0.1] Initialized new cluster connection between /10.16.3.244:5701 and **/127.0.0.6:48287**
[10.16.5.16]:5701 [dev] [4.0.1] Initialized new cluster connection between /10.16.5.16:59827 and /10.16.3.244:5701
[10.16.5.16]:5701 [dev] [4.0.1] Connection[id=1, /10.16.5.16:59827->/10.16.3.244:5701, qualifier=null, endpoint=[10.16.3.244]:5701, alive=false, connectionType=MEMBER] closed. Reason: Connection closed by the other side
but then i get million of messages like the following:
[10.16.5.16]:5701 [dev] [4.0.1] Connection[id=2, /10.16.5.16:33659->/10.16.3.244:5701, qualifier=null, endpoint=[10.16.3.244]:5701, alive=false, connectionType=NONE] closed. Reason: Connection closed by the other side
the first ‘Connection closed’ message was MEMBER type and was on the same mentioned connection we have on the initializtion message (10.16.5.16:59827 --> 10.16.3.244:5701)
but the rest of the messages are from random ports on 10.16.5.16 to the old pod.
i assume the reason for this is the init message the indicate Initialized new cluster connection between /10.16.3.244:5701 and **/127.0.0.6:48287**
it configured the connection to localhost instead of to 10.16.5.16:59827.
rolling upgrade completes the same but the log is full with million of messages from that kind.
how can i prevent this ? how to make sure hazelcast mark the connection properly between the pods ip and not the loopback?
Thanks Chen
Issue Analytics
- State:
- Created 3 years ago
- Comments:19
@lechen26, I checked Hazelcast internals and extracted a bug. Please follow it up here: https://github.com/hazelcast/hazelcast/issues/18320
The issue doesn’t cause any problems on Hazelcast’s cluster formation or regular operations as I tested, but it needs to be fixed since the member opens unnecessary connections when running behind a proxy.
Thanks for your contribution, and also patience.
I can verify that this issue is fixed by https://github.com/hazelcast/hazelcast/issues/18320.
However, when using Hazelcast behind a proxy, the users can still face some connection errors on member disconnection. This is caused by the proxy and Hazelcast containers’ startup latencies. The proxy (Envoy) continues to listen to the port and accept TCP connections even if the Hazelcast instance is not ready, which makes other Hazelcast members to connect & disconnect momentarily until the Hazelcast instance is up and running. This is expected, and not affecting Hazelcast’s lifecycle directly.
Closing this one as the main issue is fixed.