HTTP/2 stack creates excessive latency and throughput overhead
See original GitHub issueIssue Type: Bug report
Linkerd introduces a significant amount of overhead in terms of both throughput and latency for gRPC services versus two processes communicating directly.
Here is a max qps strest-grpc run for each scenario:
Direct
> ./strest-grpc server --address "127.0.0.1:9999"
> ./strest-grpc client --address "127.0.0.1:9999" --totalRequests 100000 --streams 100
{
"good": 100000,
"bad": 0,
"bytes": 0,
"latency": {
"p50": 3,
"p75": 4,
"p90": 5,
"p95": 5,
"p99": 8,
"p999": 19
},
"jitter": {
"p50": 0,
"p75": 0,
"p90": 0,
"p95": 0,
"p99": 0,
"p999": 0
}
}
Via Linkerd
> ./strest-grpc client --address "127.0.0.1:4143" --totalRequests 100000 --streams 100
{
"good": 100000,
"bad": 0,
"bytes": 0,
"latency": {
"p50": 46,
"p75": 53,
"p90": 76,
"p95": 122,
"p99": 229,
"p999": 412
},
"jitter": {
"p50": 0,
"p75": 0,
"p90": 0,
"p95": 0,
"p99": 0,
"p999": 0
}
}
Configuration:
admin:
port: 9990
ip: 0.0.0.0
routers:
- label: h2-in
protocol: h2
experimental: true
client:
initialStreamWindowBytes: 1048576
failureAccrual:
kind: none
servers:
- port: 4143
ip: 0.0.0.0
maxConcurrentStreamsPerConnection: 2147483647
initialStreamWindowBytes: 1048576
dtab: |
/svc/* => /$/inet/127.0.0.1/9999;
identifier:
kind: io.l5d.header.path
segments: 1
In addition to using strest-grpc, I also wrote a custom but crude benchmarking tool of my own, echobench, so that I could experiment with gRPC channel settings and socket options.
The custom tool also reports a significant amount of overhead when using linkerd:
Direct
=== summary ===
threads: 10 (5000 reqs per thread)
requests: 50000
throughput: 2922.804486159504/s
errors: 0
latency:
- min: 1ms
- max: 17ms
- median: 2.0ms
- avg: 2.8737975505112257ms
- p95: 7.0ms
- p99: 11.0ms
Via Linkerd
=== summary ===
threads: 10 (5000 reqs per thread)
requests: 50000
throughput: 629.5124804747321/s
errors: 0
latency:
- min: 6ms
- max: 34ms
- median: 13.0ms
- avg: 13.981156425923315ms
- p95: 20.0ms
- p99: 27.0ms
Checking the metrics.json
endpoint after a test run shows an interesting discrepancy between request and stream latencies. Stream latencies appear to be much higher:
Request latencies
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.count": 53932,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.max": 54,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.min": 0,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p50": 6,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p90": 16,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p95": 22,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p99": 31,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p9990": 41,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p9999": 50,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.sum": 390436,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.avg": 7.239412593636431,
Stream latencies
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.count": 53945,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.max": 139,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.min": 10,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p50": 46,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p90": 59,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p95": 63,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p99": 79,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p9990": 104,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p9999": 135,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.sum": 2538980,
"rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.avg": 47.06608582815831,
Another interesting peculiarity is there is only ever one stream indicated as open back to the test service, despite several threads sending requests simultaneously:
"rt/h2-in/client/$/inet/127.0.0.1/9999/stream/open_streams": 1
After discovering the overhead I attempted several configuration and code changes to alleviate this issue, including:
- Manually defining max concurrent stream values to maximum values on both ends of the connection
- Manually expanding HTTP/2 flow control window sizes to maximum allowable values on both ends of the connection
- Setting the retry buffer sizes to 0
- Removing the
ClassifiedRetries
module from the finagle client stack entirely in code - Manually expanding the maximum frame size to maximum allowable values
- Removing the DelayedReleaseService from the finagle client stack in code
None of these changes made a noticeable impact on the latency and throughput.
I took several tcpdumps to examine the behavior of the service communication under both scenarios. A big difference number of captured packets immediately jumps out: 2966 for direct and 22542 for linkerd.
Looking at the tcpdumps in Wireshark confirms why: when communicating directly the client is able to multiplex HTTP/2 frames from ~120 streams into a single TCP packet. Linkerd, by comparison, creates a TCP packet for each HTTP/2 stream frame.
Linkerd packets
Direct packets
The discrepancy in latency and throughput could be attributed to the additional syscall and packetization overhead required to forward HTTP/2 traffic in this way.
I found the discrepancy surprising given that both linkerd and grpc-java are based on the netty4 HTTP/2 stack primitives. It turns out that when the grpc-java stack was being developed, they identified a problem with netty4’s H2 stack flushing too frequently.
They identified and fixed the issue in these places:
- HTTP2 remote flow controller flushing is too eager · Issue #3688 · netty/netty · GitHub
- HTTP/2 shouldn’t flush automatically · Issue #3670 · netty/netty · GitHub
- Disable flushing on frame write in flow-controller by louiscryan · Pull Request #3691 · netty/netty · GitHub
- Reduce number of flushes · Issue #305 · grpc/grpc-java · GitHub
- Implement writes to the channel using a dedicated write queue by louiscryan · Pull Request #431 · grpc/grpc-java · GitHub
The benchmarks listed in the write queue PR indicate this change unlocked a significant jump in throughput, especially in the case of many streams (16806.718 ops/s versus 55975.008 ops/s for 1000 streams).
I don’t have an easy way to confirm how much a write queue will reduce the performance overhead I’m observing. I’m open to suggestions for a quick and dirty way to get more confidence in this as a potential solution.
I did try enabling Nagle’s (removing the TCP_NODELAY
option) from the client socket in linkerd and after a few subsequent test runs was able to get the throughput to increase in a single threaded use-case.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:7
- Comments:6 (4 by maintainers)
Top GitHub Comments
Wow. Thank you for the excellent writeup!
Thanks for the quick reply @zackangelo …hoping that Alex can have a fix soon.