question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting data from redis getting slowly when master goes down

See original GitHub issue

Hi, I deploy a redis cluster(1 master, 1 node), when i stop the master redis, and the old slave redis turns to the master redis(BUT the old master redis is still down), after that, when I get data from redis, it takes time to get result(about 1-2s at most).

The current redis cluster’s status is:

172.28.10.30:6379> CLUSTER NODES
a274999de9ccb909f9cfef07f413719df27218be 172.28.10.30:6379@16379 myself,master - 0 1568894258250 24 connected 0-16383
20ea1e3e713c7058658d91e0eee33c36eae0d030 172.28.10.29:6379@16379 master,fail? - 1568894306655 1568894288307 23 connected
172.28.10.30:6379> CLUSTER info
cluster_state:**ok**
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:2
cluster_size:1
cluster_current_epoch:24
cluster_my_epoch:24
cluster_stats_messages_ping_sent:19545
cluster_stats_messages_pong_sent:22
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:19568
cluster_stats_messages_ping_received:21
cluster_stats_messages_pong_received:20
cluster_stats_messages_received:41

I think it is the simlilar problem with issue https://github.com/Grokzen/redis-py-cluster/issues/274, cause in issue274, it has 3 master and 3 slave, it leads to a TTL exception.

I checked the source code, it follows steps below:

  1. execute_command method get OLD master node from local cache;
  2. because OLD master node is down, ConnectionError would be raised, in next time, it will retry random node.
  3. in this case, it got 50% chance to get the right node.

So, in this case, it will fall into a bad situation and cannot recover unless I initial the redis node(redis.connection_pool.nodes.initialize() method) maunually: For the following redis command, although current redis cluster is normal(OLD slave redis node become master now), we can get the right data, it takes too much time, or in bad case, we cannot get data because TTL exception.

Is it a problem? and How to fix this problem?

Thanks.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Grokzencommented, Sep 20, 2019

In my view, having a few seconds during the acctual failover scenario before the clients find back to the right master and cluster setup after the failover has happen can take a few seconds if not more, that is “kinda” expected so there is not super much that can be done to the acctual detection and testing algorithm to try to find the new cluster after one node fails out. I do agree that the step 3. can be a bit optimized in the case where a master goes down that it dont try the old master as well. But the idea of the cluster detection algorithm is that you are supposed to provide a stable set of startup nodes that you expect one of them to be inside the cluster and be part of the nodes that should carry the correct cluster state. Yes there is some case for trying to find it back with the nodes that you found during discovery in the case you have a long living cluster where you migrate nodes on a semi regular basis and the nodes setup do not look the same 6 months down the line as when you first started it up. I am not sure if i want to change this implementation as both the reference impl and most other clients use this method to find back to the cluster by using the startup_nodes in the case of a node failure.

0reactions
Grokzencommented, Sep 6, 2020

Closing this issue due to inactivity now. If this issue still persist in the RC release tag 2.0.99 @summer-zt then please open up a new issue with new tests from your side showing that the error still persists after the fixes added to the next major release.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reasons for Redis to slow down - database - Stack Overflow
There is no simple answer to this question. With all NoSQL or SQL based storage solutions, there are plenty of conditions that could...
Read more >
Redis switches intermittently (Connection with master lost)
The problem is that redis slave loses connection to the master and vice versa intermittently(Error : Connection with master lost). There is no...
Read more >
Troubleshoot high latency in ElastiCache for Redis - AWS
The following are common reasons for elevated latencies or time-out issues in ElastiCache for Redis: Latency caused by slow commands.
Read more >
Redis: Unsafe At Any Speed - Towards Data Science
This article intends to explain why Redis is unfit for use as a NoSQL database where the durability and consistency of the persisted...
Read more >
Scaling with Redis Cluster
By default, the cluster bus port is set by adding 10000 to the data port (e.g., ... To remain available when a subset...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found