Leaking connection on 500 retrys
See original GitHub issuePosting this as a followup to a Slack conversation with Alex. This is against 0.9.0 of linkerd
We are noticing that linkerd’s connection pool for certain endpoints grows rapidly during error conditions, then doesn’t shrink. When the same app gets a burst of normal traffic, the pool shrinks back to 0 fairly quickly.
The timeline is:
- Request comes in. We send that request to linkerd.
- Linkerd returns an error to us after 20 seconds(our request timeout)
- Connection pool has ~83 idle connections that do not disappear
- metrics.json showed 83 500 status errors and 83 ResponseClassificationSyntheticException
- a
netstat -atpn | grep <L5d.pid> | grep CLOSE_WAIT
shows 83 sockets in CLOSE_WAIT - It happened on Friday, and the connections are still there the following Monday
The service recovered, and is still working now. The connection pool number is correct, linkerd is holding onto CLOSE_WAIT sockets. This will eventually become an issue requiring a restart.
This happened 2 times, so the numbers in the attached metrics.json are a bit different. This also occurred on 3 separate linkerd instances at about the same time. The cause was the way the downstream app responded, but it affected 3 independent linkerds. Alex identified this in slack:
I think you got stuck in a retry loop.
"rt/cgp/dst/path/svc/live/appName/retries/total": 274,
something about that one request triggered a 5xx response from the server and it kept being retried and failing until either the retry budget was exhausted or the request timed out it looks like there were 99 requests, 97 of which were successful, and 2 of which were failures and those failures triggered 274 retries
We have retryableRead500 tuned on, which is likely the reason for the large number of connections. This is a server in our staging environment, so it gets very little traffic. In the case of the error, a single request came in that caused 83 abandoned connections.
metrics.json Compare number of connections in metrics with CLOSE_WAIT
root@ip-10-122-17-136:~# netstat -atpn | grep 3887 |grep CLOSE_WAIT |wc
166 1162 16102
Showing all CLOSE_WAIT are to the server that errored out.
root@ip-10-122-17-136:~# netstat -atpn | grep 3887 |grep CLOSE_WAIT | grep 10.122.3.77:31392|wc
166 1162 16102
And this server is still serving traffic on that ip:port. We just had some sort of disruption for this brief window.
I have the metrics being captured by datadog, so we have how the metrics changed on a 30 second window. This shows the connection count This shows the count of 500 errors And this is the bytes_sent stat for the application
Debugging attempts
We attempted to see if using the connections would result in them being cleaned up. I flooded a server that had 159 connections and 159 sockets in CLOSE_WAIT. The pool grew to 360+, then returned back to 159
Dead sockets
This is a sample of netstat showing the dead sockets that match the leaked connection counts
tcp 1 0 10.122.8.59:56734 10.122.3.77:31392 CLOSE_WAIT 4228/java
tcp 1 0 10.122.8.59:35906 10.122.3.77:31392 CLOSE_WAIT 4228/java
Linkerd log
. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:16 I 0331 18:17:16.862 UTC THREAD24 TraceId:4f2d95dc50b33383: FailureAccrualFactory marking connection to "#/consul/us-east-1-vpc-XXXX/live/appName" as dead
. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:26 I 0331 18:17:26.970 UTC THREAD23 TraceId:4f2d95dc50b33383: FailureAccrualFactory marking connection to "#/consul/us-east-1-vpc-XXXX/live/appName" as dead
. Remote Address: Inet(/10.122.3.77:31392,Map())
2017/03/31 18:17:30 E 0331 18:17:30.874 UTC THREAD10: service failure
2017/03/31 18:17:30 com.twitter.finagle.IndividualRequestTimeoutException: exceeded 20.seconds to 0.0.0.0/4140 while waiting for a response for an individual request, excluding retries. Remo
te Info: Not Available
2017/03/31 18:17:30
2017/03/31 18:17:31 E 0331 18:17:31.060 UTC THREAD24 TraceId:4f2d95dc50b33383: service failure
2017/03/31 18:17:31 Failure(20.seconds, flags=0x03) with RemoteInfo -> Upstream Address: /127.0.0.1:54910, Upstream Client Id: Not Available, Downstream Address: /10.122.3.77:31392, Downstream Client Id: #/consul/us-east-1-vpc-XXXX/live/appName, Trace Id: 4f2d95dc50b33383.4f2d95dc50b33383<:4f2d95dc50b33383
2017/03/31 18:17:31 Caused by: com.twitter.util.TimeoutException: 20.seconds
2017/03/31 18:17:31 at com.twitter.util.Future$$anonfun$within$1.apply(Future.scala:992)
Config file
admin:
port: 9990
routers:
- protocol: http
httpAccessLog: /opt/proxy/logs/access.log
label: cgp
timeoutMs: 2000
identifier:
- kind: io.l5d.header
header: X-Route-Id
interpreter:
kind: io.l5d.namerd
dst: /$/inet/namerd/4100
namespace: header
responseClassifier:
kind: io.l5d.retryableRead5XX
client:
loadBalancer:
kind: ewma
failureAccrual:
kind: io.l5d.successRate
successRate: 0.9
requests: 20
backoff:
kind: constant
ms: 10000
servers:
- port: 4140
ip: 0.0.0.0
Issue Analytics
- State:
- Created 6 years ago
- Comments:13 (13 by maintainers)
Top GitHub Comments
After some investigation, this appears to be related to retries when the responses are chunk encoded. My current theory is that when a chunk encoded response is retried, linkerd never actually reads the chunks of the discarded response and since the response has unread chunks, the connection is never released.
I will attempt to test this theory and put together a fix. I’ll update this issue as I know more.
I’ve closed this as I believe that https://github.com/linkerd/linkerd/issues/1256 fixes the issue. Please reopen this if you see this behavior again.