Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fall back to initial seed nodes on topology refresh when dynamicRefreshSources is enabled

See original GitHub issue

Bug Report

Current Behavior

In Issue #338 I mentioned we experienced a problem around the use of AWS Elasticache and Lettuce not following DNS changes.

I eventually tracked this down to the way the dynamicRefreshSources is implemented. When set to true (default) then on startup the initial seed nodes are resolved and connections established. From that point onwards the DNS entries are never reresolved so if the entire cluster were to change the application looses connectivity permanently (until restart).

When dealing with elasticache most likely only individual nodes would change most of the time however when creating a cluster it’s necessary to pick a maintenance window during which the cluster may not be available. I think it’s entirely possible that after this window all of the underlying VMs may have changed.

Input Code

This is the way we connect which exhibits the problem. To reproduce easily (without waiting for maintenance windows etc) delete the elasticache cluster and create a new one of the same name. All of the underlying IP addresses will then change.

ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enablePeriodicRefresh(Duration.ofSeconds(30))
                .dynamicRefreshSources(true)
                 .enableAllAdaptiveRefreshTriggers()
                .build();
        ClientOptions clientOptions = ClusterClientOptions.builder()
                .topologyRefreshOptions(topologyRefreshOptions)
                .build();

        RedisClusterConfiguration redisClusterConfiguration = new RedisClusterConfiguration(clusterNodes);
        LettuceClientConfiguration lettuceClientConfiguration = LettuceClientConfiguration.builder()
                .clientOptions(clientOptions).build();

The Lettuce implementation of dynamicRefreshSources can be seen here https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/io/lettuce/core/cluster/RedisClusterClient.java#L1052

Setting dynamicRefreshSources to false fixes the problem for us because it goes back to the initialUris and resolves them from hostnames.

Environment

Lettuce version: 5.0.4.RELEASE
Redis version: 3.2.6 (AWS Elasticache)
Spring boot: 2.0.2.RELEASE

Possible Solution

I don’t know if this issue is a bug exactly but I think at a minimum it would be helpful to update the documentation so it’s more obvious what the impact of that setting is in relation to DNS. The documentation does reference DNS in other areas and allows you to set other DNS resolvers but that has no impact because no attempt is being made to resolve DNS.

For an actual code fix the only thing I can think of is to resolve the initialUris when all hosts are marked as being down or having that as setting since the current solution means all connections are thrown away every time.

Issue Analytics

State:
Created 5 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

2reactions

stuartharpercommented, Aug 10, 2018

I’ve created a small hello world app based on the spring boot starter application here https://github.com/stuartharper/gs-spring-boot

The application can be launched via initial/gradlew bootRun

Access localhost:8080 will cause the application to write a key “Hello” and value “World” into the configured redis cluster. This will then be immediately read back as well. io.lettuce is set to debug so the connection details can be observed.

Configuration of redis is via \initial\src\main\resources\application.properties redis.hosts is the server:port to connect to and redis.dynamicRefreshSources controls the dynamicRefreshSources behaviour.

Scenarios I tested:

Creating an AWS Elasticache cluster (clustered, engine 3.2.6, 1 shared with 2 replicas, multi-az) and allowed the application to connect to it and write in a value. While the application remains running delete the cluster and recreate it with the same name.
Two separate clusters of standard redis 3.2.9 with 3 nodes each. Using the local hosts file I simulated a DNS change from one cluster to the next.

Results:

With redis.dynamicRefreshSources set to true the application continues trying to access the IP address of the original Elasticache cluster even after local ping has been updated to the new IP. I left it running overnight and the application connection remained broken.
With redis.dynamicRefreshSources set to false the application reconnects to the updated cluster around the same time as local ping returns the new IP
With redis.dynamicRefreshSources set to true the application connects to the IP contained in the hosts file and never switches to the second cluster even when the first is completely stopped.
With redis.dynamicRefreshSources set to false the application reconnected to the updated cluster on the next refreshinterval

Our main concern is around the AWS Elasticache maintenance window in during which multiple nodes may be replaced simultaneously. The docs say they will try to not replace too many at once but no guarantees are given https://aws.amazon.com/elasticache/elasticache-maintenance/

The concern is if the application loses connection to the cluster it will never reestablish it until it’s restarted.

0reactions

mp911decommented, Aug 27, 2018

That’s fixed now.

Top Results From Across the Web

ClusterTopologyRefreshOptions.Builder (lettuce 4.5.0.Final API)

Enables adaptive topology refreshing using one or more triggers . Adaptive refresh triggers initiate topology view updates based on events happened during Redis ......

RedisClusterClient (Lettuce 5.1.8.RELEASE API) - javadoc.io

A set of nodes is selected using a Predicate and commands can be issued to the node selection ... Returns the seed RedisURI...

lettuce-io/Lobby - Gitter

You need to enable topology refresh so Lettuce can obtain the new cluster view via CLUSTER NODES . Make sure that your seed...

scaling redis while using lettuce - Google Groups

Also, ClusterTopologyRefreshTask hits each of the 140 nodes for refresh. ... for topology changes, Redis Cluster requires some sort of pro-active handling.

Lettuce 5.1.0.RC1 发布，Redis Java 客户端

Fall back to initial seed nodes on topology refresh when dynamicRefreshSources is enabled #822 (Thanks to @stuartharper).