Improve log message for nodes that cannot be reached during reconnect/topology refresh
See original GitHub issueBug Report
Current Behavior
I’m using AWS Elasticache with three shards, one master, two replicas each. I failed the master over and the slave properly proclaimed itself the master for that shard, but Lettuce continued to fail writes until the old master was connectable (see below, the old master wasn’t even usable at first as AUTH commands failed). I have a periodic topology refresh set up, which I would have expected to remedy this, but clearly it doesn’t, as I was getting failures for another three minutes (again until the old master host came back up).
Sometime later, part of my code (by intent) destroyed the old pool, created a new one and reported the state as this: (the notable thing is `flags=[MASTER, FAIL] for the last line).
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='f5bf8d2f70ad6d478cd61b1050059a19ee027406', connected=true, slaveOf='9b9e0fc92afdc39c3b20fc1aceb2825529093f9a', pingSentTimestamp=0, pongReceivedTimestamp=1571319108972, configEpoch=8, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='9b9e0fc92afdc39c3b20fc1aceb2825529093f9a', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319106000, configEpoch=8, flags=[MYSELF, MASTER], aliases=[RedisURI [host='NAME-0001-002.NAME.BLAH.cache.amazonaws.com', port=6379]], slot count=5462]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='49ea7e2a9be162913394e800efc7a80cdfa5b60d', connected=true, slaveOf='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', pingSentTimestamp=0, pongReceivedTimestamp=1571319105000, configEpoch=1, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319108012, configEpoch=1, flags=[MASTER], aliases=[], slot count=5461]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='a9d036fd6e1ff8c7a2d48dce7d262f6ac491a936', connected=true, slaveOf='26aec6670c562365c473e1c4d63fc6bdf9d4f879', pingSentTimestamp=0, pongReceivedTimestamp=1571319107000, configEpoch=5, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='a93c6d9e31424f67b90e6e009bceaa3d0b2db65c', connected=true, slaveOf='26aec6670c562365c473e1c4d63fc6bdf9d4f879', pingSentTimestamp=0, pongReceivedTimestamp=1571319108000, configEpoch=5, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='26aec6670c562365c473e1c4d63fc6bdf9d4f879', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319106016, configEpoch=5, flags=[MASTER], aliases=[], slot count=5461]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='9db2254a3644ce2c965fe46d21a9842615a9bb93', connected=true, slaveOf='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', pingSentTimestamp=0, pongReceivedTimestamp=1571319107015, configEpoch=1, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='1bde423279fbfc22cd2d92e2776581374302655f', connected=true, slaveOf='null', pingSentTimestamp=1571319015542, pongReceivedTimestamp=1571319010000, configEpoch=3, flags=[MASTER, FAIL], aliases=[]]```
and then reports the following exception every few seconds or so:
ClusterTopologyRefresh:228 - Unable to connect to NAME-0001-001.NAME.BLAH.cache.amazonaws.com:6379
java.util.concurrent.CompletionException: io.netty.channel.ConnectTimeoutException: connection timed out: NAME-0001-001.NAME.BLAH.cache.amazonaws.com/xx.xx.xx.xx:6379
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at io.lettuce.core.AbstractRedisClient.lambda$initializeChannelAsync0$4(AbstractRedisClient.java:329)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:495)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:474)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: NAME-0001-001.NAME.BLAH.cache.amazonaws.com/xx.xx.xx.xx:6379
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
... 9 more
Then, the old master node starts accepting connections, but is in a state where I’m sending AUTH and it’s not up to dealing with that yet, but somehow, this is enough for lettuce to make things stop failing. i.e. I start getting exceptions like these
ClusterTopologyRefresh:228 - Unable to connect to NAME-0001-001.NAME.BLAH.cache.amazonaws.com:6379
java.util.concurrent.CompletionException: io.lettuce.core.RedisCommandExecutionException: ERR Client sent AUTH, but no password is set
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at io.lettuce.core.protocol.AsyncCommand.doCompleteExceptionally(AsyncCommand.java:139)
at io.lettuce.core.protocol.AsyncCommand.completeResult(AsyncCommand.java:120)
at io.lettuce.core.protocol.AsyncCommand.complete(AsyncCommand.java:111)
at io.lettuce.core.protocol.CommandHandler.complete(CommandHandler.java:646)
at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:604)
at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:556)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:617)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:534)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.lettuce.core.RedisCommandExecutionException: ERR Client sent AUTH, but no password is set
at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:135)
at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:108)
... 30 more
Also later had another burst of these messages:
ClusterTopologyRefresh:163 - Cannot retrieve partition view from RedisURI [host='NAME-0001-001.NAME.BLAH.cache.amazonaws.com', port=6379], error: java.util.concurrent.ExecutionException: io.lettuce.core.RedisLoadingException: LOADING Redis is loading the dataset in memory
which correlate to some other issues we saw, but I’ve not found a hard link.
Input Code
My clusterClient is initialized as follows:
Input Code
final ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
.enableAdaptiveRefreshTrigger(
ClusterTopologyRefreshOptions.RefreshTrigger.MOVED_REDIRECT,
ClusterTopologyRefreshOptions.RefreshTrigger.PERSISTENT_RECONNECTS)
.adaptiveRefreshTriggersTimeout(java.time.Duration.ofSeconds(30))
.enablePeriodicRefresh(java.time.Duration.ofSeconds(30))
.build();
clusterClient.setOptions(ClusterClientOptions.builder()
.topologyRefreshOptions(topologyRefreshOptions)
.build());
Expected behavior/code
Ideally, once the slave became master for the shard and had taken over the slots from the old master, it would seem that you could detect that all slots had masters and make things go. That is, as soon as the old slave had reached master status in the cluster, things should return to normal.
Environment
- Lettuce version(s): 5.1.4.RELEASE
- Redis version: 5.0.4
Possible Solution
Additional context
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
This is the expected behavior. Lettuce attempts to obtain the topology from each known node as long as the node is part of the cluster. The moment you remove the node from the cluster, Lettuce stops attempting to contact the node. We could indeed improve logging and remove the stack trace during the topology refresh.
Cluster nodes are good for serving data and Pub/Sub. We cannot drop a node from the cluster just because no slots are assigned to it.
please redisTemplate.opsForList().rightPop.Did you solve the problem that threads are blocked and subsequent operations have timed out?