Getting data from redis getting slowly when master goes down
See original GitHub issueHi, I deploy a redis cluster(1 master, 1 node), when i stop the master redis, and the old slave redis turns to the master redis(BUT the old master redis is still down), after that, when I get data from redis, it takes time to get result(about 1-2s at most).
The current redis cluster’s status is:
172.28.10.30:6379> CLUSTER NODES
a274999de9ccb909f9cfef07f413719df27218be 172.28.10.30:6379@16379 myself,master - 0 1568894258250 24 connected 0-16383
20ea1e3e713c7058658d91e0eee33c36eae0d030 172.28.10.29:6379@16379 master,fail? - 1568894306655 1568894288307 23 connected
172.28.10.30:6379> CLUSTER info
cluster_state:**ok**
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:2
cluster_size:1
cluster_current_epoch:24
cluster_my_epoch:24
cluster_stats_messages_ping_sent:19545
cluster_stats_messages_pong_sent:22
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:19568
cluster_stats_messages_ping_received:21
cluster_stats_messages_pong_received:20
cluster_stats_messages_received:41
I think it is the simlilar problem with issue https://github.com/Grokzen/redis-py-cluster/issues/274, cause in issue274, it has 3 master and 3 slave, it leads to a TTL exception.
I checked the source code, it follows steps below:
- execute_command method get OLD master node from local cache;
- because OLD master node is down, ConnectionError would be raised, in next time, it will retry random node.
- in this case, it got 50% chance to get the right node.
So, in this case, it will fall into a bad situation and cannot recover unless I initial the redis node(redis.connection_pool.nodes.initialize() method) maunually: For the following redis command, although current redis cluster is normal(OLD slave redis node become master now), we can get the right data, it takes too much time, or in bad case, we cannot get data because TTL exception.
Is it a problem? and How to fix this problem?
Thanks.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
In my view, having a few seconds during the acctual failover scenario before the clients find back to the right master and cluster setup after the failover has happen can take a few seconds if not more, that is “kinda” expected so there is not super much that can be done to the acctual detection and testing algorithm to try to find the new cluster after one node fails out. I do agree that the step 3. can be a bit optimized in the case where a master goes down that it dont try the old master as well. But the idea of the cluster detection algorithm is that you are supposed to provide a stable set of startup nodes that you expect one of them to be inside the cluster and be part of the nodes that should carry the correct cluster state. Yes there is some case for trying to find it back with the nodes that you found during discovery in the case you have a long living cluster where you migrate nodes on a semi regular basis and the nodes setup do not look the same 6 months down the line as when you first started it up. I am not sure if i want to change this implementation as both the reference impl and most other clients use this method to find back to the cluster by using the startup_nodes in the case of a node failure.
Closing this issue due to inactivity now. If this issue still persist in the RC release tag
2.0.99
@summer-zt then please open up a new issue with new tests from your side showing that the error still persists after the fixes added to the next major release.