question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Periodic IO stalls/pauses

See original GitHub issue

We’ve observed behavior in our production systems where all Lettuce operations will (seemingly without provocation) grind to a halt, then resume just as abruptly. We’ve directly measured “stalls” between 3 and 5 seconds, and can infer the existence stalls shorter than 3s and longer than 10s from other metrics. Stalls appear to happen on the order of once every host-week (which can add up quickly when there are lots of hosts in play). This generally results in a bunch of RedisCommandTimeoutException and a pileup of pending requests.

We’ve observed this behavior both with Lettuce 5.3.4 and 6.0.0 under Ubuntu 18.04. We have not yet tried 5.3.5 or 6.0.1, but (unless this is somehow a TCP NODELAY issue) I haven’t spotted anything in the changelog that seems relevant.

Our setup is that we have tens of application servers, each of which has a default ClientResources instance shared between three RedisClusterClient instances. We have a default command timeout of 3 seconds. Our application servers have 8 physical cores, and so my understanding is that our IO thread pools have 8 threads. Our application servers live in AWS EC2 and communicate with three Redis clusters hosted by AWS ElastiCache:

  • A general cache cluster with two shards (each with a leader and replica) that mostly receives simple GET/SET commands of relatively consistent sizes
  • A metrics cluster with two shards (each with a leader and replica) that also mostly receives simple GET/SET commands
  • A notification cluster with 8 shards (one leader/replica per shard) that receives a relatively high volume of messages of varying sizes and also notifies listeners of new messages via keyspace notifications

Between all three clusters, application instances issue somewhere between 5,000 and 10,000 Redis commands per second.

We’ve instrumented instances by logging from io.lettuce.core.protocol at DEBUG and by observing TCP traffic. When we experience a stall, the symptoms are:

  • All traffic from the affected instance to ALL shards in ALL Redis clusters stops completely at a TCP level
  • Other application servers are unaffected and continue communicating with Redis nodes normally
  • Lettuce continues to log attempts to write commands, though the write attempts don’t resolve for several seconds and we see no corresponding outbound packets at a TCP level
  • Keyspace notifications from the notification cluster continue to arrive at the affected instance as expected and are ACKed at the TCP level even though they’re not processed at the application layer; I’m afraid I can’t tell from the Lettuce logs if they’re “seen” by Lettuce when they arrive or if that’s also delayed

A representative log line when attempting to write a command during a stall:

DEBUG [2020-10-30 19:19:48,265] io.lettuce.core.protocol.DefaultEndpoint: [channel=0x5154e1c6, /10.0.1.170:34672 -> /10.0.0.109:6379, epid=0x49] write() writeAndFlush command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]

…which will resolve several seconds later:

DEBUG [2020-10-30 19:19:50,184] io.lettuce.core.protocol.DefaultEndpoint: [channel=0x5154e1c6, /10.0.1.170:34672 -> /10.0.0.109:6379, epid=0x49] write() done

We do not see any reads reported by Lettuce during a stall, though we do see them before and after.

I have not yet found anything in our application logs, Lettuce’s logs, or TCP events that predicts when one of these stalls will occur. We will often see TCP retransmissions or “zero window” packets during or after a stall, but have not yet seen a consistent pattern prior to the stall. We have not observed any reconnections or cluster topology changes in our own logs, and to the best of my knowledge, that’s supported by Lettuce’s own logging because channel IDs remain consistent before and after a stall.

I do usually see an increase in allocated direct memory in the minute or two before one of these stalls, but we also see increases in allocated direct memory that do not precede a stall.

I’m aware that similar issues have been reported in the past, but have generally been closed due to a lack of debug information. My next step is to capture some thread dumps when these stalls occur. In the meantime, I think we can address some hypotheses with reasonable (but not perfect) confidence:

  • I would be surprised if something in our application space were blocking IO threads because:
    • We do not issue any blocking Redis commands
    • With the exception of keyspace notifications, we do not call any commands that result in a callback from Lettuce into our code
    • Keyspace notifications are immediately dispatched to a separate ExecutorService to avoid blocking Lettuce/Netty IO threads
    • In keeping with our application’s underlying framework, we use sync() commands for virtually everything
    • It seems unlikely that we’d manage to block all IO threads simultaneously
    • The duration of a stall is not closely correlated with our command timeout (i.e. a command timing out does not appear to unclog things)
  • This does not appear to be correlated with GC activity; GC time is consistent before, during, and after a stall
  • The direct memory allocation clue may point to us sending an abnormally large (several megabytes?) message through our notification cluster, but it’s not clear to me how that would block all other traffic; I also can’t find a corresponding increase in throughput at a TCP level (and that would certainly be at odds with the observed halt in outbound traffic), and it’s also not clear why that would prevent reads, though I’m less familiar with Netty’s internals on that point
  • This does not appear to be a problem with the cluster itself because traffic continues as normal to all other application servers

I expect to have a thread dump in the next few days. In the meantime, while I can’t provide the raw debug data in its entirety, I’d be happy to answer any questions about our debug data and go digging for potentially-relevant excerpts that I can share here.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:25 (25 by maintainers)

github_iconTop GitHub Comments

1reaction
jchamberscommented, Dec 22, 2020

I understand the decision to close this, but please accept my assurance that the investigation is both active and ongoing on our end. I’ll post an update as soon as I have one.

1reaction
jchamberscommented, Nov 20, 2020

I’m continuing to work with AWS on this issue, but don’t have anything to report from that thread yet.

In thinking through what could be happening, I tried to more closely align some things in the Lettuce logs with our packet captures and noticed something curious. Here’s another representative stall:

Time               Source          Destination    Length    Info
18:01:14.015652    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=268 TSval=1646985261 TSecr=3120642697
18:01:14.029754    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=86975 Ack=147196 Win=3591 Len=268 TSval=1646985275 TSecr=3120642697
18:01:14.042778    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=87243 Ack=147196 Win=3591 Len=268 TSval=1646985288 TSecr=3120642697
18:01:14.056651    Lettuce         Redis server    336      [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=87243 Ack=147196 Win=3591 Len=268 TSval=1646985302 TSecr=3120642697
18:01:14.099746    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=87511 Ack=147196 Win=3591 Len=268 TSval=1646985345 TSecr=3120642697
18:01:14.159235    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=87779 Ack=147196 Win=3591 Len=268 TSval=1646985405 TSecr=3120642697
18:01:14.169596    Lettuce         Redis server    333      51434 → 6379 [PSH, ACK] Seq=88047 Ack=147196 Win=3591 Len=265 TSval=1646985415 TSecr=3120642697
18:01:14.170204    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=88312 Ack=147196 Win=3591 Len=268 TSval=1646985416 TSecr=3120642697
18:01:14.188244    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=88580 Ack=147196 Win=3591 Len=268 TSval=1646985434 TSecr=3120642697
18:01:14.206560    Lettuce         Redis server    333      51434 → 6379 [PSH, ACK] Seq=88848 Ack=147196 Win=3591 Len=265 TSval=1646985452 TSecr=3120642697
18:01:14.226694    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=89113 Ack=147196 Win=3591 Len=268 TSval=1646985472 TSecr=3120642697
18:01:14.248654    Lettuce         Redis server    1408     [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=1340 TSval=1646985494 TSecr=3120642697
18:01:14.668652    Lettuce         Redis server    1408     [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=1340 TSval=1646985914 TSecr=3120642697
18:01:15.500657    Lettuce         Redis server    1408     [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=1340 TSval=1646986746 TSecr=3120642697
18:01:17.168646    Lettuce         Redis server    1408     [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=1340 TSval=1646988414 TSecr=3120642697
18:01:17.168816    Redis server    Lettuce         80       [TCP Previous segment not captured] 6379 → 51434 [ACK] Seq=151107 Ack=89381 Win=1543 Len=0 TSval=3120645861 TSecr=1646985452 SLE=86707 SRE=88047
18:01:17.168829    Lettuce         Redis server    1516     51434 → 6379 [ACK] Seq=89381 Ack=147196 Win=3591 Len=1448 TSval=1646988414 TSecr=3120642697
18:01:17.168832    Lettuce         Redis server    1516     51434 → 6379 [ACK] Seq=90829 Ack=147196 Win=3591 Len=1448 TSval=1646988414 TSecr=3120642697

I filtered the logs to see what the thread handling that connection was doing at the time:

DEBUG [2020-11-12 18:01:17,013] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.RedisStateMachine: Decode done, empty stack: true
DEBUG [2020-11-12 18:01:17,013] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x9be15e7f, /10.0.2.99:59220 -> /10.0.2.111:6379, chid=0x54] Completing command LatencyMeteredCommand [type=EVALSHA, output=NestedMultiOutput [output=[], error='null'], commandType=io.lettuce.core.cluster.ClusterCommand]
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] write(ctx, ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5], promise)
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandEncoder: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379] writing command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Received: 1865 bytes, 1 commands in the stack
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Stack contains: 1 commands
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.RedisStateMachine: Decode done, empty stack: true
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Completing command LatencyMeteredCommand [type=GET, output=ValueOutput [output=REDACTED, error='null'], commandType=io.lettuce.core.cluster.ClusterCommand]
DEBUG [2020-11-12 18:01:17,020] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] write(ctx, ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5], promise)
DEBUG [2020-11-12 18:01:17,020] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandEncoder: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379] writing command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
DEBUG [2020-11-12 18:01:17,753] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Received: 2533 bytes, 1 commands in the stack
DEBUG [2020-11-12 18:01:17,872] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Stack contains: 1 commands
DEBUG [2020-11-12 18:01:17,872] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.RedisStateMachine: Decode done, empty stack: true

Even though none of those lines are related to the connection in question (on port 51434), there’s a suspicious stall between 18:01:17,020 and 18:01:17,753, and that’s right when traffic from the Redis host seems to wake back up abruptly. Checking our application logs, we see a batch of 46 timeouts starting at 18:01:17,017 and ending at 18:01:17,756.

I’ve attached a slightly longer excerpt from the Lettuce logs that shows the backlog getting cleared out. One thing that’s curious to me is that, even though we can see packets arriving during that pause in the logs, Lettuce doesn’t notice them for several hundred milliseconds, and then the packet logs and Lettuce logs seem to (temporarily?) lose agreement about what’s happening when. My wager is that the IO thread is getting tied up populating stack traces for the RedisCommandTimeoutException instances, then has to catch up on the newly-arrived packets.

One thing I think this does tell us is that we’re probably not overrunning the TCP receive buffer because we see packets arriving before Lettuce logs that it’s processed them (unless there’s some weird packet purgatory where something is taking packets from the buffer and holding them for a significant amount of time before Lettuce processes them). It still doesn’t explain why the stall happened in the first place, though.

There may also be some new hints in the attached thread log that jump out to an experienced reader that aren’t obvious to me.

Anyhow, don’t think there are any earth-shattering revelations in here, and I’ll continue to work with the AWS folks. Still, I wanted to share updates as I have them.

Cheers!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Eliminating Large JVM GC Pauses Caused by Background IO ...
Our investigations show that the pauses are induced by the JVM GC (Garbage Collection)'s write() system calls during GC log writing.
Read more >
Diagnostics in SQL Server help detect stalled and stuck I/O ...
This article discusses logic that was added to SQL Server 2000 SP4 and later versions that helps detect stuck and stalled I/O operations....
Read more >
why is io_stall_writes_ms so much higher for tempdb?
Short Answer: Seeing higher IO stalls may or may not be a problem in an of itself. You need to look at more...
Read more >
Slow Storage Reads or Writes - Brent Ozar Unlimited®
This part of our SQL Server sp_Blitz script checks sys.dm_io_virtual_file_stats looking for average read stalls (latency) over 200 milliseconds and average ...
Read more >
Investigating I/O bottlenecks - MS SQL Tips
Since the data is cumulative you can run this once and then run the query again in the future and compare the deltas...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found