Sentinel failover not detected when connection hangs
See original GitHub issueI have stumbled onto the same issue as these:
- https://github.com/luin/ioredis/issues/556
- https://github.com/luin/ioredis/issues/1021
- https://github.com/luin/ioredis/issues/1059
After some investigating, I concluded that ioredis currently relies on Redis closing the connection as described here
However, when the failover is initiated with the Redis DEBUG SLEEP
command or docker pause
, the connection simply hangs, but doesn’t terminate.
This could be solved by subscribing to sentinel messages on the +switch-master
channel. Described in the Sentinel docs as “the message most external users are interested in”
I’ve created a reproducible example: https://github.com/mjomble/ioredis-sentinel-issue
This example listens to the message outside ioredis.
Once received, it uses internal/undocumented fields to call client.connector.stream.destroy()
because redis.disconnect(true)
(which calls stream.end()
) leaves the connection open in this scenario.
Ideally, this could all happen inside ioredis. I could probably submit a PR if needed.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:12 (7 by maintainers)
Top GitHub Comments
Yeah it’s more complex than I thought. I’d go with subscribing to all sentinels because in case of a partition, as you said, it’s very likely that the connected sentinel and the old master are in the same network, so we won’t be able to get events from that sentinel.
As for implementation details, I think SentinelConnector will get a lastActiveSentinel property, which defaults to null. Once a node is resolved, the connector will subscribe to all sentinels provided by user (I don’t think it need to be dynamic in v1 as it seems non-trivial to implement). Not sure if it’s necessary but a reasonable connection count limit may be applied to avoid user provides too many sentinels.
When a +switch-master is received, we set lastActiveSentinel to the one that got the event, and disconnect so Redis#connect() will kick in. Next time SentinelConnector will try lastActiveSentinel (and then reset it) first. Wdyt?
@ohadisraeli @leibale btw do you have any inputs on the correct behaviors about whether clients should subscribe to all sentinels or not? Or may be it should behind an option so users can enable/disable?
I agree
It is certainly related. And if the root cause of your problem is that failover detection fails, then it is also the exact same issue.
If, however, it turns out that failover is successfully detected and the problem is that ioredis does not perform the necessary additional actions after successfully detecting a failover, then I would say it is a related, but separate issue.
So my recommendation is to first try and find out if failover is actually detected or not.