ConnectionWatchdog tries to reconnect to the node's previous IP
See original GitHub issueBug Report
Lettuce’ ConnectionWatchdog keeps trying to connect to old IPs of Redis nodes.
Current Behavior
- Cluster with 6 nodes (3 master nodes, each with one replica), running under Kubernetes
- Redis nodes are restarted one by one, each getting a new IP address
- The ConnectionWatchog doesn’t seem to use this information to stop connecting to the previous IPs.
RoundRobinSocketAddressSupplier says that the IP Address of 3da56d06b4a34c20ee560d3ed28a2679ba089a30 is 10.6.21.237, but RedisStateMachine has it is as 10.6.37.76.
Log messages
09:08:44.918 DEBUG RedisStateMachine Decoded LatencyMeteredCommand [type=CLUSTER, output=StatusOutput [output=
ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 10.6.34.87:6379@16379 master - 0 1594717721000 90 connected 10923-16383
72641dc0aa44972fd1ffd31ad1342cbcfd01b4fb 10.6.26.31:6379@16379 myself,slave ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 0 1594717722000 84 connected
9bc0e2cb46f424cac90d8816485d1a2728919765 10.6.13.141:6379@16379 slave b4cb4eb98ca653e888d3b2ec931898ab7c97b867 0 1594717723043 92 connected
b4cb4eb98ca653e888d3b2ec931898ab7c97b867 10.6.21.140:6379@16379 master - 0 1594717722039 92 connected 0-5460
fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 10.6.11.155:6379@16379 master - 0 1594717721036 86 connected 5461-10922
3da56d06b4a34c20ee560d3ed28a2679ba089a30 10.6.37.76:6379@16379 slave fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 0 1594717724047 86 connected
, error='null'], commandType=io.lettuce.core.cluster.topology.TimedAsyncCommand], empty stack: true
09:09:08.805 DEBUG RoundRobinSocketAddressSupplier Resolved SocketAddress 10.6.21.237:6379 using for Cluster node 3da56d06b4a34c20ee560d3ed28a2679ba089a30
09:09:08.806 DEBUG ReconnectionHandler Reconnecting to Redis at 10.6.21.237:6379
09:09:08.844 DEBUG RedisStateMachine Decoded LatencyMeteredCommand [type=CLUSTER, output=StatusOutput [output=
ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 10.6.34.87:6379@16379 master - 0 1594717747000 90 connected 10923-16383
9bc0e2cb46f424cac90d8816485d1a2728919765 10.6.13.141:6379@16379 slave b4cb4eb98ca653e888d3b2ec931898ab7c97b867 0 1594717745384 92 connected
3da56d06b4a34c20ee560d3ed28a2679ba089a30 10.6.37.76:6379@16379 slave fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 0 1594717747390 86 connected
fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 10.6.11.155:6379@16379 myself,master - 0 1594717744000 86 connected 5461-10922
b4cb4eb98ca653e888d3b2ec931898ab7c97b867 10.6.21.140:6379@16379 master - 0 1594717748401 92 connected 0-5460
72641dc0aa44972fd1ffd31ad1342cbcfd01b4fb 10.6.26.31:6379@16379 slave ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 0 1594717747000 90 connected
, error='null'], commandType=io.lettuce.core.cluster.topology.TimedAsyncCommand], empty stack: true
09:09:18.824 DEBUG ConnectionWatchdog [channel=0x03ba1129, /10.6.13.138:40214 -> /10.6.21.237:6379, last known addr=/10.6.21.237:6379] scheduleReconnect()
09:09:18.824 DEBUG ConnectionWatchdog Cannot reconnect to [10.6.21.237:6379]: connection timed out: /10.6.21.237:6379 io.netty.channel.ConnectTimeoutException: connection timed out: /10.6.21.237:6379
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
09:09:18.825 DEBUG ConnectionWatchdog [channel=0x03ba1129, /10.6.13.138:40214 -> /10.6.21.237:6379, last known addr=/10.6.21.237:6379] Reconnect attempt 61, delay 30000ms
Lettuce Configuration
Relevant Spring Boot Configuration
spring.redis.lettuce.cluster.refresh.adaptive=true
spring.redis.lettuce.cluster.refresh.period=1M
custom.redis.lettuce.cluster.refresh.dynamic-sources=false (dynamicRefreshSources)
spring.redis.cluster.nodes=\
redis-cluster-0.redis-cluster:6379,\
redis-cluster-1.redis-cluster:6379,\
redis-cluster-2.redis-cluster:6379,\
redis-cluster-3.redis-cluster:6379,\
redis-cluster-4.redis-cluster:6379,\
redis-cluster-5.redis-cluster:6379
Expected behavior/code
ConnectionWatchdog should stop trying to connect to a previous IP address of a Redis node (which is now known to have another IP address).
Environment
- Lettuce version(s): 5.3.1.RELEASE
- Redis version: 6.0.1
- Spring Boot: 2.3.1
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
lettuce-io/Lobby - Gitter
Now when I restart the Redis cluster and each node changed it's IP address I see that client tries to reconnect to the...
Read more >Spring boot always try to reconnect the failed node in Redis ...
I have a redis cluster with 3 shards. Each shard has 2 nodes, 1 primary and 1 replica. I'm using spring-boot 2.0.1.
Read more >Lettuce Reference Guide
In this section, we try to provide what we think is an easy-to-follow guide for starting with Lettuce. However, if you encounter issues...
Read more >How to Connect to an Ethernet Device for Communication
The first node address of a subnet (0) is the network ID and used to identify the subnet itself, while the last node...
Read more >Retrieve the cluster public key and cluster node IP addresses
You will use the IP addresses in Step 3 to configure the host to accept the connection from Amazon Redshift. Depending on what...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
All DNS resolution is handled by
SocketAddressResolver
andDnsResolver
. After looking intoRoundRobinSocketAddressSupplier
, it seems that DNS resolution isn’t involved at all asRoundRobinSocketAddressSupplier
is based on the initialPartitions
object. Whenever a reconnect occurs,RoundRobinSocketAddressSupplier
is asked to provide a new endpoint to connect to. If thePartitions
change in between calls toRoundRobinSocketAddressSupplier.get()
,RoundRobin
is rebuilt.For some reason, this doesn’t happen in your case. If you are able to reproduce the issue, please step into
RoundRobinSocketAddressSupplier.get()
to capture the state ofpartitions
and the inner state ofRoundRobin
to see where the mismatch stems from.I am facing the same exception as well, with almost all default configuration with lettuce. This only happens with a larger data set.
These are the configs:
this works with smaller data set, and always fail on large date set.