Document MasterSlave connection behavior on partial node failures
See original GitHub issueBug Report
Current Behavior
Connecting to master/slave setups using autodiscovery can half-fail and leave the list of nodes empty, with no chance to recover. The problem is that creating the StatefulRedisMasterSlaveConnection
will succeed, but every attempt to use it will fail. The error produced is at io.lettuce.core.masterslave.MasterSlaveConnectionProvider#getMaster
.
I believe the happens because of the initial topology refresh can fail. The first connection to the master may be fine – this causes us to not get an exception. But the refresh seems to happen out-of-band and appears to ping each discovered Redis node at io.lettuce.core.masterslave.MasterSlaveTopologyRefresh#getNodes
and more specifically io.lettuce.core.masterslave.Connections#requestPing
.
To reproduce this you can set a breakpoint (or a pause of your choice) at the call to requestPing
and send a DEBUG SLEEP [some interval]
to Redis - preferably the master node. After this you can see the knownNodes
list refresh, and if you’re in a single master environment it will set itself to an empty list. If you’re in a master/slave environment, any node that fails to ping at this exact point in time will be removed from the node list and not contacted again. I think this is probably by design, but it unfortunately produces at best unexpected results and at worst connections that can’t ever work.
From here I don’t think it can recover. The instance of the StatefulRedisMasterSlaveConnection
stays alive, but it will throw every time you try to use it because there are no nodes in the list and the topology will never refresh. This makes it hard to detect.
Input Code
- Java; this is just a basic connection - you must interrupt it as described above and introduce some manual chaos.
RedisURI redisUri = buildRedisUri(host, port, useSsl, password);
ClientResources redisClientResources = DefaultClientResources.builder()
.dnsResolver(new DirContextDnsResolver(true, false, new Properties()))
.build();
RedisClient redisClient = RedisClient.create(redisClientResources, redisUri);
StatefulRedisMasterSlaveConnection<String, String> conn = MasterSlave.connect(redisClient, new Utf8StringCodec(), redisUri);
conn.setReadFrom(ReadFrom.MASTER_PREFERRED);
// !!! Throws at `io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getMaster`
// `throw new RedisException(String.format("Master is currently unknown: %s", knownNodes));`
String result = conn.sync().get("MyKey");
Expected behavior/code
I expected that if the connections were unreachable they would either:
- Fail to connect and throw an exception (we can catch this on our end and retry) or…
- Attempt to heal itself in the future, via a mechanism like periodic refresh (such as clustered mode does)
Environment
- Lettuce version(s): 5.1.1.RELEASE
- Redis version: 4.0.10
Possible Solution
I would suggest that the list of nodes initially discovered be cached somewhere, and retries be attempted later. Clustered clients do this with periodic and adaptive refreshes. I think this should be possible using the initial discovery in non-clustered setups to produce more reliable long-term results and a chance to recover from this particular failure state.
Another solution would be to simply throw an error if no nodes are discovered so we may catch it before the StatefulRedisMasterSlaveConnection
instance is kept (just like a normal connection failure would produce). It may be worthwhile to expose the present list of nodes from the connection so they may be inspected for issues (e.g. if we only get slave nodes that may not be suitable for all cases)
Additional context
I hope this helps! Please let me know if I have left out any important details or if there is a workaround or expected behavior I’m clearly missing. I may be able to take suggestions and make them into a PR.
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
Yes, I tried it with various scenarios (timeout 1 second):
DEBUG SLEEP 100
before connectrequestPing
, thenDEBUG SLEEP 100
, then continue (both, trying to connect the master and the slave node)The above mentioned workflow (testing the connection) seems about right.
I tried to reproduce a half-failed state. When the specified endpoint does not react (
DEBUG SLEEP
), then the connection creation fails. When the master node does not respond inrequestPing(…)
, then requests to the master node fail withio.lettuce.core.RedisException: Master is currently unknown:
.Both work as they should.
MasterSlave
Javadoc now explains the actual behavior.