question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

localCachedMap pubsub/updates broken after elasticache cluster maintenance or failover test

See original GitHub issue

Expected behavior Updates to RMap data which include a local cache should propagate changes to all nodes, including local caches, as cluster topology changes occur. When AWS elasticache maintenance occurs, or failover happens, as topology changes occur or as nodes change from master to slave: RLocalCachedMap pubsub subscriptions should be reestablished, and any updates to localCache values should be captured and continue to be propagated across all redisson instances.

AWS elasticache cluster uptime is maintained throughout all maintenance operations: redisson should follow topology changes, node type changes (master/slave) and re-establish pubsub connections for RLocalCachedMaps to continue to keep data synchronized with local caches.

Actual behavior RLocalCachedMap local cache values are no longer updated on remote redisson clients when performing cluster maintenance. Subsequent changes to RLocalCachedMaps can be stale in local cached data for redisson nodes that did not originate the new data, even when data changes are pushed after full cluster topology is recovered. This causes stale/different datasets for the same keys across multiple nodes using local caches.

Steps to reproduce or test case

  1. Create elasticache redis cluster with cluster mode enabled, multiAZ enabled
  2. Create and connect 2 instances of Redisson, and create common RLocalCachedMap
  3. Trigger aws elasticache maintenance/failover test: aws elasticache test-failover --replication-group-id dev-generic-failtest --node-group-id 0001

Redis version 6.2.6, 6.0.5

Redisson version 3.17.7

Redisson configuration

  • Using a TLS/SSL connection via “rediss://” connection string to the AWS elasticache configuration endpoint

  • We are also using a nameMapper: `

      Config config = new Config();
      ClusterServersConfig clusterServers = config.useClusterServers()
              .setRetryInterval(3000)
              .setTimeout(30000)
              .setReadMode(ReadMode.MASTER_SLAVE)
              .setNameMapper(new NameMapper() {
                  @Override
                  public String map(String name) {
                      return KEY_PREFIX_COLON + name;
                  }
    
                  @Override
                  public String unmap(String name) {
                      return name.replace(KEY_PREFIX_COLON, "");
                  }
              });
    

`

  • Default values for everything else.

I created a simple test application to confirm and reproduce this issue. I can to publish the entire repo if desired, here is the pertinent code: `

MainApp() {
    LOG.info("Starting up test...");

    int count = -1;
    RLocalCachedMap<Object, Object> testMap = redissonConnection.getRedisson()
            .getLocalCachedMap("cachedmap", LocalCachedMapOptions.defaults().syncStrategy(LocalCachedMapOptions.SyncStrategy.UPDATE));
    testMap.preloadCache();
    String shortId = redissonConnection.getRedisson().getId().substring(0, 7);

    testMap.put(shortId + "_increasing", ++count);
    testMap.put(shortId + "_timestamp", System.currentTimeMillis());
    Map<Object, Object> localCachedMap = testMap.getCachedMap();

    LOG.info("starting main loop in " + this.getClass().getName());
    Thread printingHook = new Thread(testMap::clear);
    Runtime.getRuntime().addShutdownHook(printingHook);

    while (true) {
        try {
            Thread.sleep(5000);
            LOG.info("mynode: {}, current time: {} local cache:", shortId, System.currentTimeMillis());
            localCachedMap.forEach((key, value) -> LOG.info("{}: {}", key, value));

            LOG.info("mynode: {}, current time: {} real cache:", shortId, System.currentTimeMillis());
            testMap.forEach((key, value) -> LOG.info("{}: {}", key, value));

            testMap.put(shortId + "_increasing", ++count);
            testMap.put(shortId + "_timestamp", System.currentTimeMillis());
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

`

With this test, I run 2 copies of this same application. In the logs you can see that that after the test-failover occurs, the redisson instances stop getting new values in their local cache that originated in the other node.

I have enabled TRACE logging for org.redisson and have included the logs generated for both of these instances (same code running in 2 different JVMs). app02.log app01.log

Here is a pertinent section of app02.log: Notice how the local cache for the data originating at the remote node is old and stale. You can see the local cache staying correct on both nodes, and local cache stops working during failover test. image

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mrnikocommented, Nov 22, 2022

Please try attached version.

redisson-3.18.1-SNAPSHOT.jar.zip

1reaction
mrnikocommented, Nov 22, 2022

@servionsolutions

Thank you for the analysis.

Can you ask AWS Team to comment this case?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Amazon ElastiCache Managed Maintenance and Service ...
When you or Amazon ElastiCache applies a service update to one or more Redis clusters, the update is applied to no more than...
Read more >
Taming ElastiCache with Auto-discovery at Scale - Medium
Figure 3 Description of Redis cluster failover of a primary node: In the event of a maintenance event on an ElastiCache cluster's master ......
Read more >
AWS Redis Failover | All about - Bobcares
We can simulate a failure for any node in the ElastiCache cluster using the console or the AWS CLI, and see how the...
Read more >
Performance at Scale with Amazon ElastiCache - Awsstatic
Cloud (Amazon EC2) and Amazon Relational Database Service (Amazon RDS), ... ElastiCache cluster by following the steps in the appropriate User Guide:.
Read more >
Amazon ElastiCache for Redis - Global Datastore
Remove the cluster-primary from Global Datastore. After the failover command on last step, its Role is Secondary in this point. aws elasticache disassociate- ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found