question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lettuce cannot recover from connection problems

See original GitHub issue

Bug Report

Current Behavior

During troubleshooting of our production issues with Lettuce and Redis Cluster, we have discovered issues with re-connection of Pub/Sub subscriptions after network problems.

Lettuce is not sending any keep-alive packets on TCP connections dedicated to Pub/Sub subscriptions. Without keep-alives in a rare case of a sudden connection loss to a Redis node, Lettuce is not able to detect that the connection is no longer working. With default OS configuration it will be waiting for hours until OS will close the connection. In the meantime all messages published to a channel will be lost.

Input Code

Minimal code from Lettuce docs is enough to reproduce the issue.

        RedisClusterClient clusterClient = RedisClusterClient.create(Arrays.asList(node1, node2, node3));

        ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enablePeriodicRefresh(Duration.ofSeconds(15))
                .enableAllAdaptiveRefreshTriggers()
                .build();

        clusterClient.setOptions(ClusterClientOptions.builder()
                .topologyRefreshOptions(topologyRefreshOptions)
                .build());

        StatefulRedisPubSubConnection<String, String> connection = clusterClient.connectPubSub();
        connection.addListener(new RedisPubSubListener<String, String>() { ... } );

        RedisPubSubCommands<String, String> sync = connection.sync();
        sync.subscribe("broadcast");

To reproduce the issue:

  • Start Redis Cluster.
  • Connect to the cluster ans subscribe to the channel using the above code.
  • Find to which server the client is connected using tcpdump or by checking with redis-cli PUBSUB CHANNELS *.
  • Block all network traffic on that server using iptables (killing Redis process is not enough - OS will send FIN packets, and Lettuce will detect a problem and recover the subscription).
  • Redis Cluster will recover the cluster by promoting one of the replicas to the master.
  • Lettuce will not detect that connection is not longer working. And won’t receive messages published to channels. Unused connection will be closed by OS after couple hours, and then Lettuce might me able to fix the problem.

We’ve been able to find issue also in Redis Standalone:

  • Connect to Pub/Sub using Lettuce.
  • Kill traffic on master using iptables. Restart VM with Redis and restore traffic.
  • Lettuce is not detecting an issue and is listening on a dead connection.

Expected behavior/code

Lettuce should be able to detect a broken connection to fix Pub/Sub subscriptions.

Environment

  • Lettuce version(s): 5.3.4.RELEASE
  • Redis version: 5.0.5

Possible Solution

We’ve made similar tests using redis-cli client. The official client is sending keep-alive packets every 15 seconds, and is able to detect connection loss.

It would be best if Lettuce could send keep-alive packets on a Pub/Sub connection to detect network problems. That should enable Lettuce to fix Pub/Sub subscriptions.

Workarounds

We’ve found a workaround for this problem by tweaking OS params (tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes), but we would want to avoid changing OS params on all our machines that use Lettuce as a Redis client.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:3
  • Comments:13 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
adrianpasternakcommented, Sep 28, 2020

Thank you for the hint.

I’ve managed to fix the problem by adding netty-transport-native-epoll to a classpath and configuring Netty:

SocketOptions socketOptions = SocketOptions.builder()
	.keepAlive(true)
	.build();

ClientResources clientResources = ClientResources.builder()
	.nettyCustomizer(new NettyCustomizer() {
		@Override
		public void afterBootstrapInitialized(Bootstrap bootstrap) {
			bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
			bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
			bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
		}
	})
	.build();

RedisClient client = RedisClient.create(clientResources, node);
client.setOptions(socketOptions);

I also submitted a bug to Redis: https://github.com/redis/redis/issues/7855 because we think it should be documented a little better. Without above code Pub/Sub will work incorrectly after network issues. It was quite challenging to reproduce and troubleshoot this issue.

1reaction
NgSekLongcommented, Oct 25, 2021

Forget to update, we finally fixed this by adding a TCP_USER_TIMEOUT as well (i.e. socket timeout)

The final add on code looks something like this:

ClientResources clientResources = ClientResources.builder()
  .nettyCustomizer(new NettyCustomizer() {
    @Override
    public void afterBootstrapInitialized(Bootstrap bootstrap) {
      bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
      bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
      bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
      // Socket Timeout (milliseconds)
      bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 60000);
    }
  })
  .build();
// Enabled keep alive
SocketOptions socketOptions = SocketOptions.builder()
  .keepAlive(true)
  .build();
ClientOptions clientOptions = ClientOptions.builder()
  .socketOptions(socketOptions)
  .build();

We do not have the “15 mins connection timeout issue” for over 7 days now, you can try it out as well see if it work for you. Cheers!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Lettuce Reference Guide
Redis connections are designed to be long-lived and thread-safe, and if the connection is lost will reconnect until close() is called. Pending ...
Read more >
lettuce-io/Lobby - Gitter
The connection itself seems fine, as the value we expect is at the end of the payload. The app will not recover without...
Read more >
Troubleshooting Redis Connection Failures - 华为云
This topic describes why Redis connection problems occur and how to solve the problems.To troubleshoot abnormal connections to a Redis ...
Read more >
Redis Anti-Patterns Every Developer Should Avoid
Lettuce provides generic connection pool support.Lettuce connections are designed to be thread-safe so one connection can be shared amongst ...
Read more >
Getting the Active Pool Usage in Lettuce - Stack Overflow
I also used the same implementation for Lettuce although the Num Active always throws a value of zero though the number of idle...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found