question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve log message for nodes that cannot be reached during reconnect/topology refresh

See original GitHub issue

Bug Report

Current Behavior

I’m using AWS Elasticache with three shards, one master, two replicas each. I failed the master over and the slave properly proclaimed itself the master for that shard, but Lettuce continued to fail writes until the old master was connectable (see below, the old master wasn’t even usable at first as AUTH commands failed). I have a periodic topology refresh set up, which I would have expected to remedy this, but clearly it doesn’t, as I was getting failures for another three minutes (again until the old master host came back up).

Sometime later, part of my code (by intent) destroyed the old pool, created a new one and reported the state as this: (the notable thing is `flags=[MASTER, FAIL] for the last line).

node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='f5bf8d2f70ad6d478cd61b1050059a19ee027406', connected=true, slaveOf='9b9e0fc92afdc39c3b20fc1aceb2825529093f9a', pingSentTimestamp=0, pongReceivedTimestamp=1571319108972, configEpoch=8, flags=[SLAVE], aliases=[]]
​​node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='9b9e0fc92afdc39c3b20fc1aceb2825529093f9a', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319106000, configEpoch=8, flags=[MYSELF, MASTER], aliases=[RedisURI [host='NAME-0001-002.NAME.BLAH.cache.amazonaws.com', port=6379]], slot count=5462]
​​node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='49ea7e2a9be162913394e800efc7a80cdfa5b60d', connected=true, slaveOf='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', pingSentTimestamp=0, pongReceivedTimestamp=1571319105000, configEpoch=1, flags=[SLAVE], aliases=[]]
​​node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319108012, configEpoch=1, flags=[MASTER], aliases=[], slot count=5461]
​​node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='a9d036fd6e1ff8c7a2d48dce7d262f6ac491a936', connected=true, slaveOf='26aec6670c562365c473e1c4d63fc6bdf9d4f879', pingSentTimestamp=0, pongReceivedTimestamp=1571319107000, configEpoch=5, flags=[SLAVE], aliases=[]]
​​node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='a93c6d9e31424f67b90e6e009bceaa3d0b2db65c', connected=true, slaveOf='26aec6670c562365c473e1c4d63fc6bdf9d4f879', pingSentTimestamp=0, pongReceivedTimestamp=1571319108000, configEpoch=5, flags=[SLAVE], aliases=[]]
​​node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='26aec6670c562365c473e1c4d63fc6bdf9d4f879', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319106016, configEpoch=5, flags=[MASTER], aliases=[], slot count=5461]
​​node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='9db2254a3644ce2c965fe46d21a9842615a9bb93', connected=true, slaveOf='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', pingSentTimestamp=0, pongReceivedTimestamp=1571319107015, configEpoch=1, flags=[SLAVE], aliases=[]]
​​node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='1bde423279fbfc22cd2d92e2776581374302655f', connected=true, slaveOf='null', pingSentTimestamp=1571319015542, pongReceivedTimestamp=1571319010000, configEpoch=3, flags=[MASTER, FAIL], aliases=[]]```

and then reports the following exception every few seconds or so:

ClusterTopologyRefresh:228 - Unable to connect to NAME-0001-001.NAME.BLAH.cache.amazonaws.com:6379
java.util.concurrent.CompletionException: io.netty.channel.ConnectTimeoutException: connection timed out: NAME-0001-001.NAME.BLAH.cache.amazonaws.com/xx.xx.xx.xx:6379
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at io.lettuce.core.AbstractRedisClient.lambda$initializeChannelAsync0$4(AbstractRedisClient.java:329)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:495)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:474)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
        at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: NAME-0001-001.NAME.BLAH.cache.amazonaws.com/xx.xx.xx.xx:6379
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
        ... 9 more

Then, the old master node starts accepting connections, but is in a state where I’m sending AUTH and it’s not up to dealing with that yet, but somehow, this is enough for lettuce to make things stop failing. i.e. I start getting exceptions like these

ClusterTopologyRefresh:228 - Unable to connect to NAME-0001-001.NAME.BLAH.cache.amazonaws.com:6379
java.util.concurrent.CompletionException: io.lettuce.core.RedisCommandExecutionException: ERR Client sent AUTH, but no password is set
        at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
        at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
        at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
        at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at io.lettuce.core.protocol.AsyncCommand.doCompleteExceptionally(AsyncCommand.java:139)
        at io.lettuce.core.protocol.AsyncCommand.completeResult(AsyncCommand.java:120)
        at io.lettuce.core.protocol.AsyncCommand.complete(AsyncCommand.java:111)
        at io.lettuce.core.protocol.CommandHandler.complete(CommandHandler.java:646)
        at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:604)
        at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:556)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478)
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227)
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:617)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:534)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.lettuce.core.RedisCommandExecutionException: ERR Client sent AUTH, but no password is set
        at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:135)
        at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:108)
        ... 30 more

Also later had another burst of these messages:

ClusterTopologyRefresh:163 - Cannot retrieve partition view from RedisURI [host='NAME-0001-001.NAME.BLAH.cache.amazonaws.com', port=6379], error: java.util.concurrent.ExecutionException: io.lettuce.core.RedisLoadingException: LOADING Redis is loading the dataset in memory

which correlate to some other issues we saw, but I’ve not found a hard link.

Input Code

My clusterClient is initialized as follows:

Input Code
final ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
            .enableAdaptiveRefreshTrigger(
                ClusterTopologyRefreshOptions.RefreshTrigger.MOVED_REDIRECT,
                ClusterTopologyRefreshOptions.RefreshTrigger.PERSISTENT_RECONNECTS)
            .adaptiveRefreshTriggersTimeout(java.time.Duration.ofSeconds(30))
            .enablePeriodicRefresh(java.time.Duration.ofSeconds(30))
            .build();

        clusterClient.setOptions(ClusterClientOptions.builder()
            .topologyRefreshOptions(topologyRefreshOptions)
            .build());

Expected behavior/code

Ideally, once the slave became master for the shard and had taken over the slots from the old master, it would seem that you could detect that all slots had masters and make things go. That is, as soon as the old slave had reached master status in the cluster, things should return to normal.

Environment

  • Lettuce version(s): 5.1.4.RELEASE
  • Redis version: 5.0.4

Possible Solution

Additional context

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
mp911decommented, Oct 25, 2019

This is the expected behavior. Lettuce attempts to obtain the topology from each known node as long as the node is part of the cluster. The moment you remove the node from the cluster, Lettuce stops attempting to contact the node. We could indeed improve logging and remove the stack trace during the topology refresh.

Cluster nodes are good for serving data and Pub/Sub. We cannot drop a node from the cluster just because no slots are assigned to it.

0reactions
peihao91commented, Dec 4, 2019

please redisTemplate.opsForList().rightPop.Did you solve the problem that threads are blocked and subsequent operations have timed out?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do i auto-refresh my log rather then refreshing it manually
When you check box is checked, you can use setInterval() and when your checkbox is unchecked clearInterval .
Read more >
lettuce-io/lettuce-core 6.0.0.RELEASE on GitHub
Lettuce 6 aligns with Redis 6 in terms of API and protocol changes. ... Improve log message for nodes that cannot be reached...
Read more >
TOPOLOGIES - as TECHNIQUES fora - Springer
the city to a network of nodes to demonstrate that one could not make a circuit of the town while only crossing each...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found