question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Leaking connection on 500 retrys

See original GitHub issue

Posting this as a followup to a Slack conversation with Alex. This is against 0.9.0 of linkerd

We are noticing that linkerd’s connection pool for certain endpoints grows rapidly during error conditions, then doesn’t shrink. When the same app gets a burst of normal traffic, the pool shrinks back to 0 fairly quickly.

The timeline is:

  • Request comes in. We send that request to linkerd.
  • Linkerd returns an error to us after 20 seconds(our request timeout)
  • Connection pool has ~83 idle connections that do not disappear
  • metrics.json showed 83 500 status errors and 83 ResponseClassificationSyntheticException
  • a netstat -atpn | grep <L5d.pid> | grep CLOSE_WAIT shows 83 sockets in CLOSE_WAIT
  • It happened on Friday, and the connections are still there the following Monday

The service recovered, and is still working now. The connection pool number is correct, linkerd is holding onto CLOSE_WAIT sockets. This will eventually become an issue requiring a restart.

This happened 2 times, so the numbers in the attached metrics.json are a bit different. This also occurred on 3 separate linkerd instances at about the same time. The cause was the way the downstream app responded, but it affected 3 independent linkerds. Alex identified this in slack:

I think you got stuck in a retry loop. "rt/cgp/dst/path/svc/live/appName/retries/total": 274, something about that one request triggered a 5xx response from the server and it kept being retried and failing until either the retry budget was exhausted or the request timed out it looks like there were 99 requests, 97 of which were successful, and 2 of which were failures and those failures triggered 274 retries

We have retryableRead500 tuned on, which is likely the reason for the large number of connections. This is a server in our staging environment, so it gets very little traffic. In the case of the error, a single request came in that caused 83 abandoned connections.

metrics.json Compare number of connections in metrics with CLOSE_WAIT

root@ip-10-122-17-136:~# netstat -atpn | grep 3887 |grep CLOSE_WAIT |wc
    166    1162   16102

Showing all CLOSE_WAIT are to the server that errored out.

root@ip-10-122-17-136:~# netstat -atpn | grep 3887 |grep CLOSE_WAIT | grep 10.122.3.77:31392|wc
    166    1162   16102

And this server is still serving traffic on that ip:port. We just had some sort of disruption for this brief window.

I have the metrics being captured by datadog, so we have how the metrics changed on a 30 second window. This shows the connection count connections This shows the count of 500 errors 500statuscounter And this is the bytes_sent stat for the application bytespersec

Debugging attempts

We attempted to see if using the connections would result in them being cleaned up. I flooded a server that had 159 connections and 159 sockets in CLOSE_WAIT. The pool grew to 360+, then returned back to 159

Dead sockets

This is a sample of netstat showing the dead sockets that match the leaked connection counts

tcp        1      0 10.122.8.59:56734       10.122.3.77:31392       CLOSE_WAIT  4228/java       
tcp        1      0 10.122.8.59:35906       10.122.3.77:31392       CLOSE_WAIT  4228/java 

Linkerd log

. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:16 I 0331 18:17:16.862 UTC THREAD24 TraceId:4f2d95dc50b33383: FailureAccrualFactory marking connection to "#/consul/us-east-1-vpc-XXXX/live/appName" as dead
. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:26 I 0331 18:17:26.970 UTC THREAD23 TraceId:4f2d95dc50b33383: FailureAccrualFactory marking connection to "#/consul/us-east-1-vpc-XXXX/live/appName" as dead
. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:30 E 0331 18:17:30.874 UTC THREAD10: service failure
2017/03/31 18:17:30 com.twitter.finagle.IndividualRequestTimeoutException: exceeded 20.seconds to 0.0.0.0/4140 while waiting for a response for an individual request, excluding retries. Remo
te Info: Not Available
2017/03/31 18:17:30
2017/03/31 18:17:31 E 0331 18:17:31.060 UTC THREAD24 TraceId:4f2d95dc50b33383: service failure
2017/03/31 18:17:31 Failure(20.seconds, flags=0x03) with RemoteInfo -> Upstream Address: /127.0.0.1:54910, Upstream Client Id: Not Available, Downstream Address: /10.122.3.77:31392, Downstream Client Id: #/consul/us-east-1-vpc-XXXX/live/appName, Trace Id: 4f2d95dc50b33383.4f2d95dc50b33383<:4f2d95dc50b33383
2017/03/31 18:17:31 Caused by: com.twitter.util.TimeoutException: 20.seconds
2017/03/31 18:17:31     at com.twitter.util.Future$$anonfun$within$1.apply(Future.scala:992)

Config file

admin:
  port: 9990
routers:
- protocol: http
  httpAccessLog: /opt/proxy/logs/access.log
  label: cgp
  timeoutMs: 2000
  identifier:
  - kind: io.l5d.header
    header: X-Route-Id
  interpreter:
    kind: io.l5d.namerd
    dst: /$/inet/namerd/4100
    namespace: header
  responseClassifier:
    kind: io.l5d.retryableRead5XX
  client:
    loadBalancer:
      kind: ewma
    failureAccrual:
      kind: io.l5d.successRate
      successRate: 0.9
      requests: 20
      backoff:
        kind: constant
        ms: 10000
  servers:
  - port: 4140
    ip: 0.0.0.0

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
adleongcommented, Apr 26, 2017

After some investigation, this appears to be related to retries when the responses are chunk encoded. My current theory is that when a chunk encoded response is retried, linkerd never actually reads the chunks of the discarded response and since the response has unread chunks, the connection is never released.

I will attempt to test this theory and put together a fix. I’ll update this issue as I know more.

0reactions
adleongcommented, May 2, 2017

I’ve closed this as I believe that https://github.com/linkerd/linkerd/issues/1256 fixes the issue. Please reopen this if you see this behavior again.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Should you retry 500 API errors?
Error 500 indicates unhandled problems on the server, which in turn indicates a shoddy job (why not just return a suitable error in...
Read more >
DB load balancer: Automatic retries may leak queries across ...
Failover already happened and a new request ends up sending a query on the broken connection for the first time: Retry possible. Do...
Read more >
Troubleshooting connection pooling (J2C) problems in ... - IBM
The trace output can show you that a connection leak exists, and it can also show exactly what application code is leaking connections....
Read more >
How to retry only requests that resulted in 500 and did it below ...
My service A has an API endpoint that makes a search call to service B that is frequently used, and fails a bit...
Read more >
Request retry if back-end server response times out
If the available back-end servers is equal or lesser than the retry count and if all the servers times out for the request...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found