Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve log message for nodes that cannot be reached during reconnect/topology refresh

See original GitHub issue

Bug Report

Current Behavior

I’m using AWS Elasticache with three shards, one master, two replicas each. I failed the master over and the slave properly proclaimed itself the master for that shard, but Lettuce continued to fail writes until the old master was connectable (see below, the old master wasn’t even usable at first as AUTH commands failed). I have a periodic topology refresh set up, which I would have expected to remedy this, but clearly it doesn’t, as I was getting failures for another three minutes (again until the old master host came back up).

Sometime later, part of my code (by intent) destroyed the old pool, created a new one and reported the state as this: (the notable thing is `flags=[MASTER, FAIL] for the last line).

node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='f5bf8d2f70ad6d478cd61b1050059a19ee027406', connected=true, slaveOf='9b9e0fc92afdc39c3b20fc1aceb2825529093f9a', pingSentTimestamp=0, pongReceivedTimestamp=1571319108972, configEpoch=8, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='9b9e0fc92afdc39c3b20fc1aceb2825529093f9a', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319106000, configEpoch=8, flags=[MYSELF, MASTER], aliases=[RedisURI [host='NAME-0001-002.NAME.BLAH.cache.amazonaws.com', port=6379]], slot count=5462]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='49ea7e2a9be162913394e800efc7a80cdfa5b60d', connected=true, slaveOf='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', pingSentTimestamp=0, pongReceivedTimestamp=1571319105000, configEpoch=1, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319108012, configEpoch=1, flags=[MASTER], aliases=[], slot count=5461]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='a9d036fd6e1ff8c7a2d48dce7d262f6ac491a936', connected=true, slaveOf='26aec6670c562365c473e1c4d63fc6bdf9d4f879', pingSentTimestamp=0, pongReceivedTimestamp=1571319107000, configEpoch=5, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='a93c6d9e31424f67b90e6e009bceaa3d0b2db65c', connected=true, slaveOf='26aec6670c562365c473e1c4d63fc6bdf9d4f879', pingSentTimestamp=0, pongReceivedTimestamp=1571319108000, configEpoch=5, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0003-003.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='26aec6670c562365c473e1c4d63fc6bdf9d4f879', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1571319106016, configEpoch=5, flags=[MASTER], aliases=[], slot count=5461]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0002-002.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='9db2254a3644ce2c965fe46d21a9842615a9bb93', connected=true, slaveOf='5c0fb49f12ecdf05e020e6a6ff1e82d360a0f714', pingSentTimestamp=0, pongReceivedTimestamp=1571319107015, configEpoch=1, flags=[SLAVE], aliases=[]]
node:RedisClusterNodeSnapshot [uri=RedisURI [host='NAME-0001-001.NAME.BLAH.cache.amazonaws.com', port=6379], nodeId='1bde423279fbfc22cd2d92e2776581374302655f', connected=true, slaveOf='null', pingSentTimestamp=1571319015542, pongReceivedTimestamp=1571319010000, configEpoch=3, flags=[MASTER, FAIL], aliases=[]]```

and then reports the following exception every few seconds or so:

ClusterTopologyRefresh:228 - Unable to connect to NAME-0001-001.NAME.BLAH.cache.amazonaws.com:6379
java.util.concurrent.CompletionException: io.netty.channel.ConnectTimeoutException: connection timed out: NAME-0001-001.NAME.BLAH.cache.amazonaws.com/xx.xx.xx.xx:6379
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at io.lettuce.core.AbstractRedisClient.lambda$initializeChannelAsync0$4(AbstractRedisClient.java:329)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:495)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:474)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
        at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: NAME-0001-001.NAME.BLAH.cache.amazonaws.com/xx.xx.xx.xx:6379
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
        ... 9 more

Then, the old master node starts accepting connections, but is in a state where I’m sending AUTH and it’s not up to dealing with that yet, but somehow, this is enough for lettuce to make things stop failing. i.e. I start getting exceptions like these

ClusterTopologyRefresh:228 - Unable to connect to NAME-0001-001.NAME.BLAH.cache.amazonaws.com:6379
java.util.concurrent.CompletionException: io.lettuce.core.RedisCommandExecutionException: ERR Client sent AUTH, but no password is set
        at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
        at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
        at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
        at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at io.lettuce.core.protocol.AsyncCommand.doCompleteExceptionally(AsyncCommand.java:139)
        at io.lettuce.core.protocol.AsyncCommand.completeResult(AsyncCommand.java:120)
        at io.lettuce.core.protocol.AsyncCommand.complete(AsyncCommand.java:111)
        at io.lettuce.core.protocol.CommandHandler.complete(CommandHandler.java:646)
        at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:604)
        at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:556)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478)
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227)
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:617)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:534)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.lettuce.core.RedisCommandExecutionException: ERR Client sent AUTH, but no password is set
        at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:135)
        at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:108)
        ... 30 more

Also later had another burst of these messages:

ClusterTopologyRefresh:163 - Cannot retrieve partition view from RedisURI [host='NAME-0001-001.NAME.BLAH.cache.amazonaws.com', port=6379], error: java.util.concurrent.ExecutionException: io.lettuce.core.RedisLoadingException: LOADING Redis is loading the dataset in memory

which correlate to some other issues we saw, but I’ve not found a hard link.

Input Code

My clusterClient is initialized as follows:

Input Code

final ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
            .enableAdaptiveRefreshTrigger(
                ClusterTopologyRefreshOptions.RefreshTrigger.MOVED_REDIRECT,
                ClusterTopologyRefreshOptions.RefreshTrigger.PERSISTENT_RECONNECTS)
            .adaptiveRefreshTriggersTimeout(java.time.Duration.ofSeconds(30))
            .enablePeriodicRefresh(java.time.Duration.ofSeconds(30))
            .build();

        clusterClient.setOptions(ClusterClientOptions.builder()
            .topologyRefreshOptions(topologyRefreshOptions)
            .build());

Expected behavior/code

Ideally, once the slave became master for the shard and had taken over the slots from the old master, it would seem that you could detect that all slots had masters and make things go. That is, as soon as the old slave had reached master status in the cluster, things should return to normal.

Environment

Lettuce version(s): 5.1.4.RELEASE
Redis version: 5.0.4

Possible Solution

Additional context

Issue Analytics

State:
Created 4 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

mp911decommented, Oct 25, 2019

This is the expected behavior. Lettuce attempts to obtain the topology from each known node as long as the node is part of the cluster. The moment you remove the node from the cluster, Lettuce stops attempting to contact the node. We could indeed improve logging and remove the stack trace during the topology refresh.

Cluster nodes are good for serving data and Pub/Sub. We cannot drop a node from the cluster just because no slots are assigned to it.

0reactions

peihao91commented, Dec 4, 2019

please redisTemplate.opsForList().rightPop.Did you solve the problem that threads are blocked and subsequent operations have timed out?