question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add support for disconnect on timeout to recover early from no `RST` packet failures

See original GitHub issue

Bug Report

I’m one of the Jedis Reviewers and our customers are experiencing unrecoverable issues with Lettuce in production.

Lettuce connects to a Redis host and reads and writes normally. However, if the host fails (the hardware problem directly causes the shutdown, and there is no RST reply to the client at this time), the client will continue to time out until the tcp retransmission ends, and it can be recovered. At this time, it takes about 925.6 s in Linux ( Refer to tcp_retries2 ).

          set k v
client ------------------> redis

          redis server down, no rst
          
          set k v (retran)  1
tcp ------------------> redis (no reply)

      	  set k v (retran)  2
tcp ------------------> redis (no reply)     

    ... after 925.6s

           RST 
tcp ------------------> redis 

      reconnect

Why KeepAlive doesn’t fix this

https://github.com/lettuce-io/lettuce-core/issues/1437 (Lettuce supports the option to set KEEPALIVE since version 6.1.0 )

Because the priority of the retransmission packet is higher than that of keepalive, before reaching the keepalive stage, it will continue to retransmit until it is reconnected.

In what scenario is this question sent?

  • In most cases, when the operating system is shut down and the process exits, RST can be returned to the client, but RST will not be returned when power is cut off or some machine hardware fails.
  • In cloud environments, SLB is usually used. When the backend host fails, if the SLB does not support connection draining, there will be problems.

How to reproduce this issue

  1. Start a Redis on a certain port, let’s say 6379, and use the following code to connect to Redis.
        RedisClient client = RedisClient.create(RedisURI.Builder.redis(host, port).withPassword(args[1])
            .withTimeout(Duration.ofSeconds(timeout)).build());

        client.setOptions(ClientOptions.builder()
            .socketOptions(socketOptions)
            .autoReconnect(autoReconnect)
            .disconnectedBehavior(disconnectedBehavior)
            .build());

        RedisCommands<String, String> sync = client.connect().sync();

        for (int i = 0; i < times; i++) {
            Thread.sleep(1000);

            try {
                LOGGER.info("{}:{}", i, sync.set("" + i, "" + i));
            } catch (Exception e) {
                LOGGER.error("Set Exception: {}", e.getMessage());
            }
        }
  1. Use iptables to disable port 6379 packets on the Redis machine.
iptables -A INPUT -p tcp --dport 6379 -j DROP
iptables -A OUTPUT -p tcp --sport 6379 -j DROP
  1. Observe that the client starts timing out and cannot recover until after 925.6 s (related to tcp_retries2)

  2. After the test, clear the iptables rules

iptables -F INPUT
iptables -F OUTPUT

How to fix this

We should provide the activation mechanism of the application layer, that is, on the underlying Netty link, periodically insert the activation data packet, if the activation data packet times out, the client will initiate a reconnection to recover quickly.

How Jedis avoids this problem

Jedis is a connection pool mode. When an API times out, Jedis will destroy the link and obtain it again from the connection pool, which can avoid the above problems.

Environment

  • Lettuce version(s): main branch
  • Redis version: unstable branch

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:17 (11 by maintainers)

github_iconTop GitHub Comments

4reactions
yangbodong22011commented, Apr 27, 2022

I think it makes sense to host such a feature (disconnect on timeout) within TimeoutOptions.

Agreed, but I think a better strategy is to reconnect after X (1 by default) consecutive timeouts. The reasons are as follows:

  1. Some users configure the timout to be very small, and the timeout is frequent for them, but the continuous timeout of X times may be an abnormal situation.
  2. Lettuce is a non-connection pool mode, and there is an overhead for new connections, which may not be acceptable for users in point 1.
2reactions
yangbodong22011commented, Jul 1, 2022

@yangbodong22011 How’s the PR going 😃 We need this mechanism badly.

Waiting for @mp911de to have time to process it, we don’t have a firm strategy yet.

Read more comments on GitHub >

github_iconTop Results From Across the Web

6 Ways to Fix Connection Reset by peer - howtouselinux
Connection reset by peer means the TCP stream was abnormally closed from the other end. A TCP RST was received and the connection...
Read more >
K13223: Configuring the BIG-IP system to log TCP RST packets
Description · Log in to tmsh by entering the following command: tmsh · To reset the statistics for TCP RST packets, enter the...
Read more >
Too many TCP connections causes disconnections
I have a game server which runs with TCP connections. Server disconnects users randomly. I think its related with TCP ...
Read more >
How to Fix The "Connection reset by peer" SSH Error
The “ssh_exchange_identification: read: Connection reset by peer” message is not specific enough to immediately explain what triggered the ...
Read more >
Github - unexpected disconnect while reading sideband packet
First of all, check your network connection stability. If there is no problem with network connection try another solution; it may work: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found