Fall back to initial seed nodes on topology refresh when dynamicRefreshSources is enabled
See original GitHub issueBug Report
Current Behavior
In Issue #338 I mentioned we experienced a problem around the use of AWS Elasticache and Lettuce not following DNS changes.
I eventually tracked this down to the way the dynamicRefreshSources is implemented. When set to true (default) then on startup the initial seed nodes are resolved and connections established. From that point onwards the DNS entries are never reresolved so if the entire cluster were to change the application looses connectivity permanently (until restart).
When dealing with elasticache most likely only individual nodes would change most of the time however when creating a cluster it’s necessary to pick a maintenance window during which the cluster may not be available. I think it’s entirely possible that after this window all of the underlying VMs may have changed.
Input Code
This is the way we connect which exhibits the problem. To reproduce easily (without waiting for maintenance windows etc) delete the elasticache cluster and create a new one of the same name. All of the underlying IP addresses will then change.
ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
.enablePeriodicRefresh(Duration.ofSeconds(30))
.dynamicRefreshSources(true)
.enableAllAdaptiveRefreshTriggers()
.build();
ClientOptions clientOptions = ClusterClientOptions.builder()
.topologyRefreshOptions(topologyRefreshOptions)
.build();
RedisClusterConfiguration redisClusterConfiguration = new RedisClusterConfiguration(clusterNodes);
LettuceClientConfiguration lettuceClientConfiguration = LettuceClientConfiguration.builder()
.clientOptions(clientOptions).build();
The Lettuce implementation of dynamicRefreshSources can be seen here https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/io/lettuce/core/cluster/RedisClusterClient.java#L1052
Setting dynamicRefreshSources to false fixes the problem for us because it goes back to the initialUris and resolves them from hostnames.
Environment
- Lettuce version: 5.0.4.RELEASE
- Redis version: 3.2.6 (AWS Elasticache)
- Spring boot: 2.0.2.RELEASE
Possible Solution
I don’t know if this issue is a bug exactly but I think at a minimum it would be helpful to update the documentation so it’s more obvious what the impact of that setting is in relation to DNS. The documentation does reference DNS in other areas and allows you to set other DNS resolvers but that has no impact because no attempt is being made to resolve DNS.
For an actual code fix the only thing I can think of is to resolve the initialUris when all hosts are marked as being down or having that as setting since the current solution means all connections are thrown away every time.
Issue Analytics
- State:
- Created 5 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
I’ve created a small hello world app based on the spring boot starter application here https://github.com/stuartharper/gs-spring-boot
The application can be launched via initial/gradlew bootRun
Access localhost:8080 will cause the application to write a key “Hello” and value “World” into the configured redis cluster. This will then be immediately read back as well. io.lettuce is set to debug so the connection details can be observed.
Configuration of redis is via \initial\src\main\resources\application.properties redis.hosts is the server:port to connect to and redis.dynamicRefreshSources controls the dynamicRefreshSources behaviour.
Scenarios I tested:
Creating an AWS Elasticache cluster (clustered, engine 3.2.6, 1 shared with 2 replicas, multi-az) and allowed the application to connect to it and write in a value. While the application remains running delete the cluster and recreate it with the same name.
Two separate clusters of standard redis 3.2.9 with 3 nodes each. Using the local hosts file I simulated a DNS change from one cluster to the next.
Results:
With redis.dynamicRefreshSources set to true the application continues trying to access the IP address of the original Elasticache cluster even after local ping has been updated to the new IP. I left it running overnight and the application connection remained broken.
With redis.dynamicRefreshSources set to false the application reconnects to the updated cluster around the same time as local ping returns the new IP
With redis.dynamicRefreshSources set to true the application connects to the IP contained in the hosts file and never switches to the second cluster even when the first is completely stopped.
With redis.dynamicRefreshSources set to false the application reconnected to the updated cluster on the next refreshinterval
Our main concern is around the AWS Elasticache maintenance window in during which multiple nodes may be replaced simultaneously. The docs say they will try to not replace too many at once but no guarantees are given https://aws.amazon.com/elasticache/elasticache-maintenance/
The concern is if the application loses connection to the cluster it will never reestablish it until it’s restarted.
That’s fixed now.