Lettuce cannot recover from connection problems
See original GitHub issueBug Report
Current Behavior
During troubleshooting of our production issues with Lettuce and Redis Cluster, we have discovered issues with re-connection of Pub/Sub subscriptions after network problems.
Lettuce is not sending any keep-alive packets on TCP connections dedicated to Pub/Sub subscriptions. Without keep-alives in a rare case of a sudden connection loss to a Redis node, Lettuce is not able to detect that the connection is no longer working. With default OS configuration it will be waiting for hours until OS will close the connection. In the meantime all messages published to a channel will be lost.
Input Code
Minimal code from Lettuce docs is enough to reproduce the issue.
RedisClusterClient clusterClient = RedisClusterClient.create(Arrays.asList(node1, node2, node3));
ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
.enablePeriodicRefresh(Duration.ofSeconds(15))
.enableAllAdaptiveRefreshTriggers()
.build();
clusterClient.setOptions(ClusterClientOptions.builder()
.topologyRefreshOptions(topologyRefreshOptions)
.build());
StatefulRedisPubSubConnection<String, String> connection = clusterClient.connectPubSub();
connection.addListener(new RedisPubSubListener<String, String>() { ... } );
RedisPubSubCommands<String, String> sync = connection.sync();
sync.subscribe("broadcast");
To reproduce the issue:
- Start Redis Cluster.
- Connect to the cluster ans subscribe to the channel using the above code.
- Find to which server the client is connected using tcpdump or by checking with redis-cli PUBSUB CHANNELS *.
- Block all network traffic on that server using iptables (killing Redis process is not enough - OS will send FIN packets, and Lettuce will detect a problem and recover the subscription).
- Redis Cluster will recover the cluster by promoting one of the replicas to the master.
- Lettuce will not detect that connection is not longer working. And won’t receive messages published to channels. Unused connection will be closed by OS after couple hours, and then Lettuce might me able to fix the problem.
We’ve been able to find issue also in Redis Standalone:
- Connect to Pub/Sub using Lettuce.
- Kill traffic on master using iptables. Restart VM with Redis and restore traffic.
- Lettuce is not detecting an issue and is listening on a dead connection.
Expected behavior/code
Lettuce should be able to detect a broken connection to fix Pub/Sub subscriptions.
Environment
- Lettuce version(s): 5.3.4.RELEASE
- Redis version: 5.0.5
Possible Solution
We’ve made similar tests using redis-cli client. The official client is sending keep-alive packets every 15 seconds, and is able to detect connection loss.
It would be best if Lettuce could send keep-alive packets on a Pub/Sub connection to detect network problems. That should enable Lettuce to fix Pub/Sub subscriptions.
Workarounds
We’ve found a workaround for this problem by tweaking OS params (tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes), but we would want to avoid changing OS params on all our machines that use Lettuce as a Redis client.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:13 (3 by maintainers)
Top GitHub Comments
Thank you for the hint.
I’ve managed to fix the problem by adding netty-transport-native-epoll to a classpath and configuring Netty:
I also submitted a bug to Redis: https://github.com/redis/redis/issues/7855 because we think it should be documented a little better. Without above code Pub/Sub will work incorrectly after network issues. It was quite challenging to reproduce and troubleshoot this issue.
Forget to update, we finally fixed this by adding a
TCP_USER_TIMEOUT
as well (i.e. socket timeout)The final add on code looks something like this:
We do not have the “15 mins connection timeout issue” for over 7 days now, you can try it out as well see if it work for you. Cheers!