Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HTTP/2 stack creates excessive latency and throughput overhead

See original GitHub issue

Issue Type: Bug report

Linkerd introduces a significant amount of overhead in terms of both throughput and latency for gRPC services versus two processes communicating directly.

Here is a max qps strest-grpc run for each scenario:

Direct

> ./strest-grpc server --address "127.0.0.1:9999"
> ./strest-grpc client --address "127.0.0.1:9999" --totalRequests 100000 --streams 100 
{
  "good": 100000,
  "bad": 0,
  "bytes": 0,
  "latency": {
    "p50": 3,
    "p75": 4,
    "p90": 5,
    "p95": 5,
    "p99": 8,
    "p999": 19
  },
  "jitter": {
    "p50": 0,
    "p75": 0,
    "p90": 0,
    "p95": 0,
    "p99": 0,
    "p999": 0
  }
}

Via Linkerd

> ./strest-grpc client --address "127.0.0.1:4143" --totalRequests 100000 --streams 100 
{
  "good": 100000,
  "bad": 0,
  "bytes": 0,
  "latency": {
    "p50": 46,
    "p75": 53,
    "p90": 76,
    "p95": 122,
    "p99": 229,
    "p999": 412
  },
  "jitter": {
    "p50": 0,
    "p75": 0,
    "p90": 0,
    "p95": 0,
    "p99": 0,
    "p999": 0
  }
}

Configuration:

admin:
  port: 9990
  ip: 0.0.0.0

routers: 
  - label: h2-in
    protocol: h2
    experimental: true
    client:
      initialStreamWindowBytes: 1048576
      failureAccrual:
        kind: none
    servers:
      - port: 4143
        ip: 0.0.0.0
        maxConcurrentStreamsPerConnection: 2147483647
        initialStreamWindowBytes: 1048576
    dtab: |
       /svc/* => /$/inet/127.0.0.1/9999;
    identifier:
      kind: io.l5d.header.path
      segments: 1

In addition to using strest-grpc, I also wrote a custom but crude benchmarking tool of my own, echobench, so that I could experiment with gRPC channel settings and socket options.

The custom tool also reports a significant amount of overhead when using linkerd:

Direct

=== summary ===
threads: 10 (5000 reqs per thread)
requests: 50000
throughput: 2922.804486159504/s
errors: 0
latency:
- min: 1ms
- max: 17ms
- median: 2.0ms
- avg: 2.8737975505112257ms
- p95: 7.0ms
- p99: 11.0ms

Via Linkerd

=== summary ===
threads: 10 (5000 reqs per thread)
requests: 50000
throughput: 629.5124804747321/s
errors: 0
latency:
- min: 6ms
- max: 34ms
- median: 13.0ms
- avg: 13.981156425923315ms
- p95: 20.0ms
- p99: 27.0ms

Checking the metrics.json endpoint after a test run shows an interesting discrepancy between request and stream latencies. Stream latencies appear to be much higher:

Request latencies

  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.count": 53932,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.max": 54,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.min": 0,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p50": 6,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p90": 16,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p95": 22,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p99": 31,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p9990": 41,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.p9999": 50,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.sum": 390436,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/request_latency_ms.avg": 7.239412593636431,

Stream latencies

  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.count": 53945,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.max": 139,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.min": 10,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p50": 46,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p90": 59,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p95": 63,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p99": 79,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p9990": 104,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.p9999": 135,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.sum": 2538980,
  "rt/h2-in/client/$/inet/127.0.0.1/9999/service/svc/echo.EchoService/response/stream/stream_duration_ms.avg": 47.06608582815831,

Another interesting peculiarity is there is only ever one stream indicated as open back to the test service, despite several threads sending requests simultaneously:

"rt/h2-in/client/$/inet/127.0.0.1/9999/stream/open_streams": 1

After discovering the overhead I attempted several configuration and code changes to alleviate this issue, including:

Manually defining max concurrent stream values to maximum values on both ends of the connection
Manually expanding HTTP/2 flow control window sizes to maximum allowable values on both ends of the connection
Setting the retry buffer sizes to 0
Removing the ClassifiedRetries module from the finagle client stack entirely in code
Manually expanding the maximum frame size to maximum allowable values
Removing the DelayedReleaseService from the finagle client stack in code

None of these changes made a noticeable impact on the latency and throughput.

I took several tcpdumps to examine the behavior of the service communication under both scenarios. A big difference number of captured packets immediately jumps out: 2966 for direct and 22542 for linkerd.

Looking at the tcpdumps in Wireshark confirms why: when communicating directly the client is able to multiplex HTTP/2 frames from ~120 streams into a single TCP packet. Linkerd, by comparison, creates a TCP packet for each HTTP/2 stream frame.

Linkerd packets

Direct packets

The discrepancy in latency and throughput could be attributed to the additional syscall and packetization overhead required to forward HTTP/2 traffic in this way.

I found the discrepancy surprising given that both linkerd and grpc-java are based on the netty4 HTTP/2 stack primitives. It turns out that when the grpc-java stack was being developed, they identified a problem with netty4’s H2 stack flushing too frequently.

They identified and fixed the issue in these places:

The benchmarks listed in the write queue PR indicate this change unlocked a significant jump in throughput, especially in the case of many streams (16806.718 ops/s versus 55975.008 ops/s for 1000 streams).

I don’t have an easy way to confirm how much a write queue will reduce the performance overhead I’m observing. I’m open to suggestions for a quick and dirty way to get more confidence in this as a potential solution.

I did try enabling Nagle’s (removing the TCP_NODELAY option) from the client socket in linkerd and after a few subsequent test runs was able to get the throughput to increase in a single threaded use-case.

Issue Analytics

State:
Created 5 years ago
Reactions:7
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

wmorgancommented, Sep 19, 2018

Wow. Thank you for the excellent writeup!

0reactions

evhfla-zzcommented, Sep 26, 2018

Thanks for the quick reply @zackangelo …hoping that Alex can have a fix soon.

Top Results From Across the Web

Introduction to HTTP/2 - web.dev

The primary goals for HTTP/2 are to reduce latency by enabling full request and response multiplexing, minimize protocol overhead via efficient ...

Improve throughput and concurrency with HTTP/2 - Vespa Blog

HTTP/2 allows more efficient network usage, with features like header compression, which reduces overall traffic and latency; and multiple, ...

HTTP/2 - High Performance Browser Networking (O'Reilly)

The primary goals for HTTP/2 are to reduce latency by enabling full request and response multiplexing, minimize protocol overhead via efficient compression ......

12. HTTP/2 - High Performance Browser Networking [Book]

The primary goals for HTTP/2 are to reduce latency by enabling full request and response multiplexing, minimize protocol overhead via efficient compression ......

Performance benefit of http/2 over http for single request

Data download (from server to client) has a slight overhead for HTTP/2 because each DATA frame has a 9 octets overhead that may...