question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sentinel failover not detected when connection hangs

See original GitHub issue

I have stumbled onto the same issue as these:

After some investigating, I concluded that ioredis currently relies on Redis closing the connection as described here

However, when the failover is initiated with the Redis DEBUG SLEEP command or docker pause, the connection simply hangs, but doesn’t terminate.

This could be solved by subscribing to sentinel messages on the +switch-master channel. Described in the Sentinel docs as “the message most external users are interested in”

I’ve created a reproducible example: https://github.com/mjomble/ioredis-sentinel-issue This example listens to the message outside ioredis. Once received, it uses internal/undocumented fields to call client.connector.stream.destroy() because redis.disconnect(true) (which calls stream.end()) leaves the connection open in this scenario.

Ideally, this could all happen inside ioredis. I could probably submit a PR if needed.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
luincommented, Mar 31, 2021

Yeah it’s more complex than I thought. I’d go with subscribing to all sentinels because in case of a partition, as you said, it’s very likely that the connected sentinel and the old master are in the same network, so we won’t be able to get events from that sentinel.

As for implementation details, I think SentinelConnector will get a lastActiveSentinel property, which defaults to null. Once a node is resolved, the connector will subscribe to all sentinels provided by user (I don’t think it need to be dynamic in v1 as it seems non-trivial to implement). Not sure if it’s necessary but a reasonable connection count limit may be applied to avoid user provides too many sentinels.

When a +switch-master is received, we set lastActiveSentinel to the one that got the event, and disconnect so Redis#connect() will kick in. Next time SentinelConnector will try lastActiveSentinel (and then reset it) first. Wdyt?

@ohadisraeli @leibale btw do you have any inputs on the correct behaviors about whether clients should subscribe to all sentinels or not? Or may be it should behind an option so users can enable/disable?

0reactions
mjomblecommented, Apr 14, 2022

I think the subscription should be reset to another sentinel if we have a failover.

I agree

From my view i think it’s related. Maybe I’m wrong you can tell me.

It is certainly related. And if the root cause of your problem is that failover detection fails, then it is also the exact same issue.

If, however, it turns out that failover is successfully detected and the problem is that ioredis does not perform the necessary additional actions after successfully detecting a failover, then I would say it is a related, but separate issue.

So my recommendation is to first try and find out if failover is actually detected or not.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sentinel failover not detected when connection hangs
This example listens to the message outside ioredis. Once received, it uses internal/undocumented fields to call client.connector.stream.destroy ...
Read more >
Redis sentinel failover not working - Stack Overflow
OK, if you notice the sentinel log, when it starts up, even before the master instance stops working, it says that two slaves...
Read more >
Redis Sentinel Automatic Failover not working, manual ...
I am new to setting up Redis Clusters, but I seem to have been able to set one up that works. I have...
Read more >
How to ensure high availability of Redis with Redis Sentinel
Automatic failover: If any of the masters is not working as expected, it can commence a failsafe procedure where any of the replicas...
Read more >
Configuring Redis servers for failover operation
IMPORTANT *** # # By default Sentinel will not be reachable from interfaces ... the failover timeout (counting since # the moment a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found