Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Document MasterSlave connection behavior on partial node failures

See original GitHub issue

Bug Report

Current Behavior

Connecting to master/slave setups using autodiscovery can half-fail and leave the list of nodes empty, with no chance to recover. The problem is that creating the StatefulRedisMasterSlaveConnection will succeed, but every attempt to use it will fail. The error produced is at io.lettuce.core.masterslave.MasterSlaveConnectionProvider#getMaster.

I believe the happens because of the initial topology refresh can fail. The first connection to the master may be fine – this causes us to not get an exception. But the refresh seems to happen out-of-band and appears to ping each discovered Redis node at io.lettuce.core.masterslave.MasterSlaveTopologyRefresh#getNodes and more specifically io.lettuce.core.masterslave.Connections#requestPing.

To reproduce this you can set a breakpoint (or a pause of your choice) at the call to requestPing and send a DEBUG SLEEP [some interval] to Redis - preferably the master node. After this you can see the knownNodes list refresh, and if you’re in a single master environment it will set itself to an empty list. If you’re in a master/slave environment, any node that fails to ping at this exact point in time will be removed from the node list and not contacted again. I think this is probably by design, but it unfortunately produces at best unexpected results and at worst connections that can’t ever work.

From here I don’t think it can recover. The instance of the StatefulRedisMasterSlaveConnection stays alive, but it will throw every time you try to use it because there are no nodes in the list and the topology will never refresh. This makes it hard to detect.

Input Code

Java; this is just a basic connection - you must interrupt it as described above and introduce some manual chaos.

RedisURI redisUri = buildRedisUri(host, port, useSsl, password);
ClientResources redisClientResources = DefaultClientResources.builder()
        .dnsResolver(new DirContextDnsResolver(true, false, new Properties()))
        .build();

RedisClient redisClient = RedisClient.create(redisClientResources, redisUri);

StatefulRedisMasterSlaveConnection<String, String> conn = MasterSlave.connect(redisClient, new Utf8StringCodec(), redisUri);
conn.setReadFrom(ReadFrom.MASTER_PREFERRED);

// !!! Throws at `io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getMaster`
//               `throw new RedisException(String.format("Master is currently unknown: %s", knownNodes));`
String result = conn.sync().get("MyKey");

Expected behavior/code

I expected that if the connections were unreachable they would either:

Fail to connect and throw an exception (we can catch this on our end and retry) or…
Attempt to heal itself in the future, via a mechanism like periodic refresh (such as clustered mode does)

Environment

Lettuce version(s): 5.1.1.RELEASE
Redis version: 4.0.10

Possible Solution

I would suggest that the list of nodes initially discovered be cached somewhere, and retries be attempted later. Clustered clients do this with periodic and adaptive refreshes. I think this should be possible using the initial discovery in non-clustered setups to produce more reliable long-term results and a chance to recover from this particular failure state.

Another solution would be to simply throw an error if no nodes are discovered so we may catch it before the StatefulRedisMasterSlaveConnection instance is kept (just like a normal connection failure would produce). It may be worthwhile to expose the present list of nodes from the connection so they may be inspected for issues (e.g. if we only get slave nodes that may not be suitable for all cases)

Additional context

I hope this helps! Please let me know if I have left out any important details or if there is a workaround or expected behavior I’m clearly missing. I may be able to take suggestions and make them into a PR.

Issue Analytics

State:
Created 5 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

mp911decommented, Nov 7, 2018

Yes, I tried it with various scenarios (timeout 1 second):

DEBUG SLEEP 100 before connect
Breakpoint in requestPing, then DEBUG SLEEP 100, then continue (both, trying to connect the master and the slave node)

1reaction

mp911decommented, Nov 7, 2018

The above mentioned workflow (testing the connection) seems about right.

I tried to reproduce a half-failed state. When the specified endpoint does not react (DEBUG SLEEP), then the connection creation fails. When the master node does not respond in requestPing(…), then requests to the master node fail with io.lettuce.core.RedisException: Master is currently unknown:.

Both work as they should. MasterSlave Javadoc now explains the actual behavior.

Top Results From Across the Web

Data Replication in distributed systems (Part-1) - Medium

Master or slave node can fail at any point. Nodes could be down either due to a crash or due to maintenance. Failure...

Redis slave keeps disconnecting during syn with master or ...

Hi All,. We are finding some strange behaviour in our redis cluster. where a slave tries to do full sync with master and...

Configure MariaDB Xpand as a Replication Slave

To start any Slaves that are configured on this cluster and not already running, issue the following command. To load-balance replication traffic, Slaves...

4.3.2 Replacing a failed master

Let's look at an example scenario involving a master, a slave, ... Listing 4.4An example sequence of commands for replacing a failed master...

sentinel cannot promote initial master back to MASTER mode ...

Now the actual issue happens when sentinel tries to upgrade master node from slave to master. Below is the master node redis config...