Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Leaking connection on 500 retrys

See original GitHub issue

Posting this as a followup to a Slack conversation with Alex. This is against 0.9.0 of linkerd

We are noticing that linkerd’s connection pool for certain endpoints grows rapidly during error conditions, then doesn’t shrink. When the same app gets a burst of normal traffic, the pool shrinks back to 0 fairly quickly.

The timeline is:

Request comes in. We send that request to linkerd.
Linkerd returns an error to us after 20 seconds(our request timeout)
Connection pool has ~83 idle connections that do not disappear
metrics.json showed 83 500 status errors and 83 ResponseClassificationSyntheticException
a netstat -atpn | grep <L5d.pid> | grep CLOSE_WAIT shows 83 sockets in CLOSE_WAIT
It happened on Friday, and the connections are still there the following Monday

The service recovered, and is still working now. The connection pool number is correct, linkerd is holding onto CLOSE_WAIT sockets. This will eventually become an issue requiring a restart.

This happened 2 times, so the numbers in the attached metrics.json are a bit different. This also occurred on 3 separate linkerd instances at about the same time. The cause was the way the downstream app responded, but it affected 3 independent linkerds. Alex identified this in slack:

I think you got stuck in a retry loop. "rt/cgp/dst/path/svc/live/appName/retries/total": 274, something about that one request triggered a 5xx response from the server and it kept being retried and failing until either the retry budget was exhausted or the request timed out it looks like there were 99 requests, 97 of which were successful, and 2 of which were failures and those failures triggered 274 retries

We have retryableRead500 tuned on, which is likely the reason for the large number of connections. This is a server in our staging environment, so it gets very little traffic. In the case of the error, a single request came in that caused 83 abandoned connections.

metrics.json Compare number of connections in metrics with CLOSE_WAIT

root@ip-10-122-17-136:~# netstat -atpn | grep 3887 |grep CLOSE_WAIT |wc
    166    1162   16102

Showing all CLOSE_WAIT are to the server that errored out.

root@ip-10-122-17-136:~# netstat -atpn | grep 3887 |grep CLOSE_WAIT | grep 10.122.3.77:31392|wc
    166    1162   16102

And this server is still serving traffic on that ip:port. We just had some sort of disruption for this brief window.

I have the metrics being captured by datadog, so we have how the metrics changed on a 30 second window. This shows the connection count connections This shows the count of 500 errors 500statuscounter And this is the bytes_sent stat for the application bytespersec

Debugging attempts

We attempted to see if using the connections would result in them being cleaned up. I flooded a server that had 159 connections and 159 sockets in CLOSE_WAIT. The pool grew to 360+, then returned back to 159

Dead sockets

This is a sample of netstat showing the dead sockets that match the leaked connection counts

tcp        1      0 10.122.8.59:56734       10.122.3.77:31392       CLOSE_WAIT  4228/java       
tcp        1      0 10.122.8.59:35906       10.122.3.77:31392       CLOSE_WAIT  4228/java

Linkerd log

. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:16 I 0331 18:17:16.862 UTC THREAD24 TraceId:4f2d95dc50b33383: FailureAccrualFactory marking connection to "#/consul/us-east-1-vpc-XXXX/live/appName" as dead
. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:26 I 0331 18:17:26.970 UTC THREAD23 TraceId:4f2d95dc50b33383: FailureAccrualFactory marking connection to "#/consul/us-east-1-vpc-XXXX/live/appName" as dead
. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:30 E 0331 18:17:30.874 UTC THREAD10: service failure
2017/03/31 18:17:30 com.twitter.finagle.IndividualRequestTimeoutException: exceeded 20.seconds to 0.0.0.0/4140 while waiting for a response for an individual request, excluding retries. Remo
te Info: Not Available
2017/03/31 18:17:30
2017/03/31 18:17:31 E 0331 18:17:31.060 UTC THREAD24 TraceId:4f2d95dc50b33383: service failure
2017/03/31 18:17:31 Failure(20.seconds, flags=0x03) with RemoteInfo -> Upstream Address: /127.0.0.1:54910, Upstream Client Id: Not Available, Downstream Address: /10.122.3.77:31392, Downstream Client Id: #/consul/us-east-1-vpc-XXXX/live/appName, Trace Id: 4f2d95dc50b33383.4f2d95dc50b33383<:4f2d95dc50b33383
2017/03/31 18:17:31 Caused by: com.twitter.util.TimeoutException: 20.seconds
2017/03/31 18:17:31     at com.twitter.util.Future$$anonfun$within$1.apply(Future.scala:992)

Config file

admin:
  port: 9990
routers:
- protocol: http
  httpAccessLog: /opt/proxy/logs/access.log
  label: cgp
  timeoutMs: 2000
  identifier:
  - kind: io.l5d.header
    header: X-Route-Id
  interpreter:
    kind: io.l5d.namerd
    dst: /$/inet/namerd/4100
    namespace: header
  responseClassifier:
    kind: io.l5d.retryableRead5XX
  client:
    loadBalancer:
      kind: ewma
    failureAccrual:
      kind: io.l5d.successRate
      successRate: 0.9
      requests: 20
      backoff:
        kind: constant
        ms: 10000
  servers:
  - port: 4140
    ip: 0.0.0.0

Issue Analytics

State:
Created 6 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

adleongcommented, Apr 26, 2017

After some investigation, this appears to be related to retries when the responses are chunk encoded. My current theory is that when a chunk encoded response is retried, linkerd never actually reads the chunks of the discarded response and since the response has unread chunks, the connection is never released.

I will attempt to test this theory and put together a fix. I’ll update this issue as I know more.

0reactions

adleongcommented, May 2, 2017

I’ve closed this as I believe that https://github.com/linkerd/linkerd/issues/1256 fixes the issue. Please reopen this if you see this behavior again.

Top Results From Across the Web

Should you retry 500 API errors?

Error 500 indicates unhandled problems on the server, which in turn indicates a shoddy job (why not just return a suitable error in...

DB load balancer: Automatic retries may leak queries across ...

Failover already happened and a new request ends up sending a query on the broken connection for the first time: Retry possible. Do...

Troubleshooting connection pooling (J2C) problems in ... - IBM

The trace output can show you that a connection leak exists, and it can also show exactly what application code is leaking connections....

How to retry only requests that resulted in 500 and did it below ...

My service A has an API endpoint that makes a search call to service B that is frequently used, and fails a bit...

Request retry if back-end server response times out

If the available back-end servers is equal or lesser than the retry count and if all the servers times out for the request...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Leaking connection on 500 retrys

Debugging attempts

Dead sockets

Linkerd log

Config file

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Possible issues with Kubernetes Ingress, default backends, and lexicographic ordering

linkerd fails to route correctly for a recreated k8s namespace