Periodic IO stalls/pauses
See original GitHub issueWe’ve observed behavior in our production systems where all Lettuce operations will (seemingly without provocation) grind to a halt, then resume just as abruptly. We’ve directly measured “stalls” between 3 and 5 seconds, and can infer the existence stalls shorter than 3s and longer than 10s from other metrics. Stalls appear to happen on the order of once every host-week (which can add up quickly when there are lots of hosts in play). This generally results in a bunch of RedisCommandTimeoutException
and a pileup of pending requests.
We’ve observed this behavior both with Lettuce 5.3.4 and 6.0.0 under Ubuntu 18.04. We have not yet tried 5.3.5 or 6.0.1, but (unless this is somehow a TCP NODELAY issue) I haven’t spotted anything in the changelog that seems relevant.
Our setup is that we have tens of application servers, each of which has a default ClientResources
instance shared between three RedisClusterClient
instances. We have a default command timeout of 3 seconds. Our application servers have 8 physical cores, and so my understanding is that our IO thread pools have 8 threads. Our application servers live in AWS EC2 and communicate with three Redis clusters hosted by AWS ElastiCache:
- A general cache cluster with two shards (each with a leader and replica) that mostly receives simple GET/SET commands of relatively consistent sizes
- A metrics cluster with two shards (each with a leader and replica) that also mostly receives simple GET/SET commands
- A notification cluster with 8 shards (one leader/replica per shard) that receives a relatively high volume of messages of varying sizes and also notifies listeners of new messages via keyspace notifications
Between all three clusters, application instances issue somewhere between 5,000 and 10,000 Redis commands per second.
We’ve instrumented instances by logging from io.lettuce.core.protocol
at DEBUG
and by observing TCP traffic. When we experience a stall, the symptoms are:
- All traffic from the affected instance to ALL shards in ALL Redis clusters stops completely at a TCP level
- Other application servers are unaffected and continue communicating with Redis nodes normally
- Lettuce continues to log attempts to write commands, though the write attempts don’t resolve for several seconds and we see no corresponding outbound packets at a TCP level
- Keyspace notifications from the notification cluster continue to arrive at the affected instance as expected and are ACKed at the TCP level even though they’re not processed at the application layer; I’m afraid I can’t tell from the Lettuce logs if they’re “seen” by Lettuce when they arrive or if that’s also delayed
A representative log line when attempting to write a command during a stall:
DEBUG [2020-10-30 19:19:48,265] io.lettuce.core.protocol.DefaultEndpoint: [channel=0x5154e1c6, /10.0.1.170:34672 -> /10.0.0.109:6379, epid=0x49] write() writeAndFlush command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
…which will resolve several seconds later:
DEBUG [2020-10-30 19:19:50,184] io.lettuce.core.protocol.DefaultEndpoint: [channel=0x5154e1c6, /10.0.1.170:34672 -> /10.0.0.109:6379, epid=0x49] write() done
We do not see any reads reported by Lettuce during a stall, though we do see them before and after.
I have not yet found anything in our application logs, Lettuce’s logs, or TCP events that predicts when one of these stalls will occur. We will often see TCP retransmissions or “zero window” packets during or after a stall, but have not yet seen a consistent pattern prior to the stall. We have not observed any reconnections or cluster topology changes in our own logs, and to the best of my knowledge, that’s supported by Lettuce’s own logging because channel IDs remain consistent before and after a stall.
I do usually see an increase in allocated direct memory in the minute or two before one of these stalls, but we also see increases in allocated direct memory that do not precede a stall.
I’m aware that similar issues have been reported in the past, but have generally been closed due to a lack of debug information. My next step is to capture some thread dumps when these stalls occur. In the meantime, I think we can address some hypotheses with reasonable (but not perfect) confidence:
- I would be surprised if something in our application space were blocking IO threads because:
- We do not issue any blocking Redis commands
- With the exception of keyspace notifications, we do not call any commands that result in a callback from Lettuce into our code
- Keyspace notifications are immediately dispatched to a separate
ExecutorService
to avoid blocking Lettuce/Netty IO threads - In keeping with our application’s underlying framework, we use
sync()
commands for virtually everything - It seems unlikely that we’d manage to block all IO threads simultaneously
- The duration of a stall is not closely correlated with our command timeout (i.e. a command timing out does not appear to unclog things)
- This does not appear to be correlated with GC activity; GC time is consistent before, during, and after a stall
- The direct memory allocation clue may point to us sending an abnormally large (several megabytes?) message through our notification cluster, but it’s not clear to me how that would block all other traffic; I also can’t find a corresponding increase in throughput at a TCP level (and that would certainly be at odds with the observed halt in outbound traffic), and it’s also not clear why that would prevent reads, though I’m less familiar with Netty’s internals on that point
- This does not appear to be a problem with the cluster itself because traffic continues as normal to all other application servers
I expect to have a thread dump in the next few days. In the meantime, while I can’t provide the raw debug data in its entirety, I’d be happy to answer any questions about our debug data and go digging for potentially-relevant excerpts that I can share here.
Issue Analytics
- State:
- Created 3 years ago
- Comments:25 (25 by maintainers)
Top GitHub Comments
I understand the decision to close this, but please accept my assurance that the investigation is both active and ongoing on our end. I’ll post an update as soon as I have one.
I’m continuing to work with AWS on this issue, but don’t have anything to report from that thread yet.
In thinking through what could be happening, I tried to more closely align some things in the Lettuce logs with our packet captures and noticed something curious. Here’s another representative stall:
I filtered the logs to see what the thread handling that connection was doing at the time:
Even though none of those lines are related to the connection in question (on port 51434), there’s a suspicious stall between 18:01:17,020 and 18:01:17,753, and that’s right when traffic from the Redis host seems to wake back up abruptly. Checking our application logs, we see a batch of 46 timeouts starting at 18:01:17,017 and ending at 18:01:17,756.
I’ve attached a slightly longer excerpt from the Lettuce logs that shows the backlog getting cleared out. One thing that’s curious to me is that, even though we can see packets arriving during that pause in the logs, Lettuce doesn’t notice them for several hundred milliseconds, and then the packet logs and Lettuce logs seem to (temporarily?) lose agreement about what’s happening when. My wager is that the IO thread is getting tied up populating stack traces for the
RedisCommandTimeoutException
instances, then has to catch up on the newly-arrived packets.One thing I think this does tell us is that we’re probably not overrunning the TCP receive buffer because we see packets arriving before Lettuce logs that it’s processed them (unless there’s some weird packet purgatory where something is taking packets from the buffer and holding them for a significant amount of time before Lettuce processes them). It still doesn’t explain why the stall happened in the first place, though.
There may also be some new hints in the attached thread log that jump out to an experienced reader that aren’t obvious to me.
Anyhow, don’t think there are any earth-shattering revelations in here, and I’ll continue to work with the AWS folks. Still, I wanted to share updates as I have them.
Cheers!