Linkerd times out requests following k8s pod deletion
See original GitHub issueSummary
Using a modified Linkerd1 lifecycle environment (commit), observed slow-cooker requests timing out, with downstream Linked reporting CancelledRequestException
followed by Failed mid-stream. Terminating stream, closing connection
followed by ChannelClosedException
.
This test environment deletes a bb-terminus
pod every 30 seconds, while the upstream slow-cooker
and bb-p2p-broadcast
pods attempts to continuously make requests.
Steps to reproduce
Deploy
cat linkerd.yml | kubectl apply -f - && bin/deploy 1
Observe
Around 2018-09-26T17:29:37, slow-cooker requests start timing out (full log):
2018-09-26T17:29:27Z 10/0/0 10 100% 10s 18 [ 20 21 21 21 ] 21 0
2018-09-26T17:29:37Z 0/1/0 10 10% 10s 37 [ 37 37 37 37 ] 37 0
Get http://bb-p2p.lifecycle1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-09-26T17:29:47Z 0/0/1 10 0% 10s 0 [ 0 0 0 0 ] 0 0 -
Get http://bb-p2p.lifecycle1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-09-26T17:29:57Z 0/0/1 10 0% 10s 0 [ 0 0 0 0 ] 0 0 -
linkerd log
D 0926 17:29:29.275 UTC THREAD25: [S L:/10.233.66.68:4340 R:/10.233.66.77:39628 S:2501] stream closed
D 0926 17:29:39.267 UTC THREAD30 TraceId:92a26f0c24a89eff: Exception propagated to the default monitor (upstream address: /10.233.66.72:59644, downstream address: /10.233.66.68:4141, label: %/io.l5d.k8s.daemonset/linkerd/http-incoming/l5d/#/io.l5d.k8s/lifecycle1/http/bb-p2p).
com.twitter.finagle.CancelledRequestException: request cancelled. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /10.233.66.77:7070, Downstream label: %/io.l5d.k8s.localnode/10.233.66.68/#/io.l5d.k8s/lifecycle1/http/bb-p2p, Trace Id: 92a26f0c24a89eff.d7d924a2654069fb<:d844b6d81c2331f9
D 0926 17:29:39.268 UTC THREAD26 TraceId:92a26f0c24a89eff: Exception propagated to the default monitor (upstream address: /10.233.66.68:47320, downstream address: /10.233.66.77:7070, label: %/io.l5d.k8s.localnode/10.233.66.68/#/io.l5d.k8s/lifecycle1/http/bb-p2p).
com.twitter.finagle.CancelledRequestException: request cancelled. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /10.233.66.77:7070, Downstream label: %/io.l5d.k8s.localnode/10.233.66.68/#/io.l5d.k8s/lifecycle1/http/bb-p2p, Trace Id: 92a26f0c24a89eff.d7d924a2654069fb<:d844b6d81c2331f9
D 0926 17:29:39.271 UTC THREAD25 TraceId:92a26f0c24a89eff: Failed mid-stream. Terminating stream, closing connection
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /10.233.66.68:4141. Remote Info: Not Available
at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.channelInactive(ChannelTransport.scala:196)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.codec.MessageAggregator.channelInactive(MessageAggregator.java:417)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:420)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:281)
at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:223)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at com.twitter.finagle.netty4.channel.ChannelRequestStatsHandler.channelInactive(ChannelRequestStatsHandler.scala:41)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at com.twitter.finagle.netty4.channel.ChannelStatsHandler.channelInactive(ChannelStatsHandler.scala:148)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
at java.lang.Thread.run(Thread.java:748)
D 0926 17:29:39.271 UTC THREAD27 TraceId:92a26f0c24a89eff: Failed mid-stream. Terminating stream, closing connection
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /10.233.66.77:7070. Remote Info: Not Available
at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.channelInactive(ChannelTransport.scala:196)
...
E 0926 17:29:39.275 UTC THREAD26 TraceId:92a26f0c24a89eff: service failure: com.twitter.finagle.CancelledRequestException: request cancelled. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /10.233.66.77:7070, Downstream label: %/io.l5d.k8s.localnode/10.233.66.68/#/io.l5d.k8s/lifecycle1/http/bb-p2p, Trace Id: 92a26f0c24a89eff.d7d924a2654069fb<:d844b6d81c2331f9
E 0926 17:29:39.275 UTC THREAD30 TraceId:92a26f0c24a89eff: service failure: com.twitter.finagle.CancelledRequestException: request cancelled. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /10.233.66.77:7070, Downstream label: %/io.l5d.k8s.localnode/10.233.66.68/#/io.l5d.k8s/lifecycle1/http/bb-p2p, Trace Id: 92a26f0c24a89eff.d7d924a2654069fb<:d844b6d81c2331f9
D 0926 17:29:39.287 UTC THREAD25: [S L:/10.233.66.68:4340 R:/10.233.66.77:39628 S:2503] initialized stream
D 0926 17:29:39.290 UTC THREAD25 TraceId:69fac7e51063f2d1: k8s lookup: /lifecycle1/grpc/bb-terminus1 /lifecycle1/grpc/bb-terminus1
D 0926 17:29:39.294 UTC THREAD25: [S L:/10.233.66.68:4340 R:/10.233.66.77:39628 S:2503] stream closed
D 0926 17:29:41.621 UTC THREAD29 TraceId:c8f97d6b63697a48: k8s ns lifecycle1 service bb-terminus1 modified endpoints
D 0926 17:29:41.630 UTC THREAD28 TraceId:630292ce49f848be: k8s ns lifecycle1 service bb-terminus2 modified endpoints
D 0926 17:29:49.263 UTC THREAD29 TraceId:f97a23c7f3f8ee43: Exception propagated to the default monitor (upstream address: /10.233.66.72:38982, downstream address: /10.233.66.68:4141, label: %/io.l5d.k8s.daemonset/linkerd/http-incoming/l5d/#/io.l5d.k8s/lifecycle1/http/bb-p2p).
com.twitter.finagle.CancelledRequestException: request cancelled. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /10.233.66.77:7070, Downstream label: %/io.l5d.k8s.localnode/10.233.66.68/#/io.l5d.k8s/lifecycle1/http/bb-p2p, Trace Id: f97a23c7f3f8ee43.898e458de51e998c<:ec154c3d40d22178
D 0926 17:29:49.263 UTC THREAD30 TraceId:f97a23c7f3f8ee43: Failed mid-stream. Terminating stream, closing connection
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /10.233.66.68:4141. Remote Info: Not Available
at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.channelInactive(ChannelTransport.scala:196)
linkerd metrics.json
https://gist.github.com/siggy/9b0b7784db5df929512631cffc010f85
linkerd client_state.json
https://gist.github.com/siggy/29ac1a5fd3c4639a42e05cc04f146042
linkerd namerd_state.json
https://gist.github.com/siggy/1180f911e2e16c929940b1dc1ef11e19
bb log
time="2018-09-26T17:29:28Z" level=info msg="Making request to [147.75.70.161:4340 / bb-terminus2.lifecycle1]"
time="2018-09-26T17:29:28Z" level=info msg="Making request to [147.75.70.161:4340 / bb-terminus1.lifecycle1]"
time="2018-09-26T17:29:28Z" level=error msg="Error when broadcasting request [requestUID:\"in:http-sid:broadcast-channel-grpc:-1-h1:7070-266590609\" ] to client [147.75.70.161:4340 / bb-terminus2.lifecycle1]: rpc error: code = Internal desc = transport: received the unexpected content-type \"text/plain\""
time="2018-09-26T17:29:28Z" level=error msg="Error when broadcasting request [requestUID:\"in:http-sid:broadcast-channel-grpc:-1-h1:7070-266590609\" ] to client [147.75.70.161:4340 / bb-terminus1.lifecycle1]: rpc error: code = Internal desc = transport: received the unexpected content-type \"text/plain\""
time="2018-09-26T17:29:28Z" level=info msg="Finished broadcast"
time="2018-09-26T17:29:28Z" level=error msg="Error while handling HTTP request: error handling http request: downstream server [147.75.70.161:4340 / bb-terminus2.lifecycle1] returned error: rpc error: code = Internal desc = transport: received the unexpected content-type \"text/plain\",downstream server [147.75.70.161:4340 / bb-terminus1.lifecycle1] returned error: rpc error: code = Internal desc = transport: received the unexpected content-type \"text/plain\""
time="2018-09-26T17:29:29Z" level=info msg="Received request with empty body, assigning new request UID [in:http-sid:broadcast-channel-grpc:-1-h1:7070-267164731] to it"
time="2018-09-26T17:29:29Z" level=debug msg="Received HTTP request [in:http-sid:broadcast-channel-grpc:-1-h1:7070-267164731] [&{Method:GET URL:/ Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[User-Agent:[Go-http-client/1.1] Sc-Req-Id:[1251] Via:[1.1 linkerd, 1.1 linkerd] L5d-Dst-Client:[/%/io.l5d.k8s.localnode/10.233.66.68/#/io.l5d.k8s/lifecycle1/http/bb-p2p] Content-Length:[0] L5d-Dst-Service:[/svc/bb-p2p.lifecycle1] L5d-Ctx-Trace:[19kkomVAafvYRLbYHCMx+ZKibwwkqJ7/AAAAAAAAAAA=] L5d-Reqid:[92a26f0c24a89eff]] Body:{} GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:bb-p2p.lifecycle1 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr:10.233.66.68:47720 RequestURI:/ TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0005b9940}] Context [context.Background.WithValue(&http.contextKey{name:\"http-server\"}, &http.Server{Addr:\":7070\", Handler:(*protocols.httpHandler)(0xc00000e0a8), TLSConfig:(*tls.Config)(0xc0001d0000), ReadTimeout:0, ReadHeaderTimeout:0, WriteTimeout:0, IdleTimeout:0, MaxHeaderBytes:0, TLSNextProto:map[string]func(*http.Server, *tls.Conn, http.Handler){\"h2\":(func(*http.Server, *tls.Conn, http.Handler))(0x72a820)}, ConnState:(func(net.Conn, http.ConnState))(nil), ErrorLog:(*log.Logger)(nil), disableKeepAlives:0, inShutdown:0, nextProtoOnce:sync.Once{m:sync.Mutex{state:0, sema:0x0}, done:0x1}, nextProtoErr:error(nil), mu:sync.Mutex{state:0, sema:0x0}, listeners:map[*net.Listener]struct {}{(*net.Listener)(0xc0001a4080):struct {}{}}, activeConn:map[*http.conn]struct {}{(*http.conn)(0xc0002aa000):struct {}{}}, doneChan:(chan struct {})(nil), onShutdown:[]func(){(func())(0x7337f0)}}).WithValue(&http.contextKey{name:\"local-addr\"}, &net.TCPAddr{IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xe9, 0x42, 0x4d}, Port:7070, Zone:\"\"}).WithCancel.WithCancel] Body [{RequestUID:in:http-sid:broadcast-channel-grpc:-1-h1:7070-267164731 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}]"
time="2018-09-26T17:29:29Z" level=info msg="Starting broadcast to [2] downstream services"
time="2018-09-26T17:29:29Z" level=info msg="Making request to [147.75.70.161:4340 / bb-terminus2.lifecycle1]"
time="2018-09-26T17:29:29Z" level=info msg="Making request to [147.75.70.161:4340 / bb-terminus1.lifecycle1]"
time="2018-09-26T17:29:29Z" level=error msg="Error when broadcasting request [requestUID:\"in:http-sid:broadcast-channel-grpc:-1-h1:7070-267164731\" ] to client [147.75.70.161:4340 / bb-terminus1.lifecycle1]: rpc error: code = Internal desc = transport: received the unexpected content-type \"text/plain\""
time="2018-09-26T17:29:39Z" level=error msg="Error when broadcasting request [requestUID:\"in:http-sid:broadcast-channel-grpc:-1-h1:7070-267164731\" ] to client [147.75.70.161:4340 / bb-terminus2.lifecycle1]: rpc error: code = Canceled desc = context canceled"
time="2018-09-26T17:29:39Z" level=info msg="Finished broadcast"
time="2018-09-26T17:29:39Z" level=error msg="Error while handling HTTP request: error handling http request: downstream server [147.75.70.161:4340 / bb-terminus1.lifecycle1] returned error: rpc error: code = Internal desc = transport: received the unexpected content-type \"text/plain\",downstream server [147.75.70.161:4340 / bb-terminus2.lifecycle1] returned error: rpc error: code = Canceled desc = context canceled"
time="2018-09-26T17:29:39Z" level=info msg="Received request with empty body, assigning new request UID [in:http-sid:broadcast-channel-grpc:-1-h1:7070-286270779] to it"
time="2018-09-26T17:29:39Z" level=debug msg="Received HTTP request [in:http-sid:broadcast-channel-grpc:-1-h1:7070-286270779] [&{Method:GET URL:/ Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[L5d-Dst-Client:[/%/io.l5d.k8s.localnode/10.233.66.68/#/io.l5d.k8s/lifecycle1/http/bb-p2p] Content-Length:[0] L5d-Reqid:[f97a23c7f3f8ee43] Via:[1.1 linkerd, 1.1 linkerd] Sc-Req-Id:[1252] L5d-Dst-Service:[/svc/bb-p2p.lifecycle1] L5d-Ctx-Trace:[iY5FjeUemYzsFUw9QNIhePl6I8fz+O5DAAAAAAAAAAA=] User-Agent:[Go-http-client/1.1]] Body:{} GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:bb-p2p.lifecycle1 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr:10.233.66.68:55272 RequestURI:/ TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0003dfb00}] Context [context.Background.WithValue(&http.contextKey{name:\"http-server\"}, &http.Server{Addr:\":7070\", Handler:(*protocols.httpHandler)(0xc00000e0a8), TLSConfig:(*tls.Config)(0xc0001d0000), ReadTimeout:0, ReadHeaderTimeout:0, WriteTimeout:0, IdleTimeout:0, MaxHeaderBytes:0, TLSNextProto:map[string]func(*http.Server, *tls.Conn, http.Handler){\"h2\":(func(*http.Server, *tls.Conn, http.Handler))(0x72a820)}, ConnState:(func(net.Conn, http.ConnState))(nil), ErrorLog:(*log.Logger)(nil), disableKeepAlives:0, inShutdown:0, nextProtoOnce:sync.Once{m:sync.Mutex{state:0, sema:0x0}, done:0x1}, nextProtoErr:error(nil), mu:sync.Mutex{state:0, sema:0x0}, listeners:map[*net.Listener]struct {}{(*net.Listener)(0xc0001a4080):struct {}{}}, activeConn:map[*http.conn]struct {}{(*http.conn)(0xc000712be0):struct {}{}}, doneChan:(chan struct {})(nil), onShutdown:[]func(){(func())(0x7337f0)}}).WithValue(&http.contextKey{name:\"local-addr\"}, &net.TCPAddr{IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xe9, 0x42, 0x4d}, Port:7070, Zone:\"\"}).WithCancel.WithCancel] Body [{RequestUID:in:http-sid:broadcast-channel-grpc:-1-h1:7070-286270779 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}]"
time="2018-09-26T17:29:39Z" level=info msg="Starting broadcast to [2] downstream services"
time="2018-09-26T17:29:39Z" level=info msg="Making request to [147.75.70.161:4340 / bb-terminus2.lifecycle1]"
time="2018-09-26T17:29:39Z" level=info msg="Making request to [147.75.70.161:4340 / bb-terminus1.lifecycle1]"
time="2018-09-26T17:29:39Z" level=error msg="Error when broadcasting request [requestUID:\"in:http-sid:broadcast-channel-grpc:-1-h1:7070-286270779\" ] to client [147.75.70.161:4340 / bb-terminus1.lifecycle1]: rpc error: code = Internal desc = transport: received the unexpected content-type \"text/plain\""
time="2018-09-26T17:29:49Z" level=error msg="Error when broadcasting request [requestUID:\"in:http-sid:broadcast-channel-grpc:-1-h1:7070-286270779\" ] to client [147.75.70.161:4340 / bb-terminus2.lifecycle1]: rpc error: code = Canceled desc = context canceled"
time="2018-09-26T17:29:49Z" level=info msg="Finished broadcast"
time="2018-09-26T17:29:49Z" level=error msg="Error while handling HTTP request: error handling http request: downstream server [147.75.70.161:4340 / bb-terminus1.lifecycle1] returned error: rpc error: code = Internal desc = transport: received the unexpected content-type \"text/plain\",downstream server [147.75.70.161:4340 / bb-terminus2.lifecycle1] returned error: rpc error: code = Canceled desc = context canceled"
redeployer (pod deletion) log
2018-09-26T17:27:56.642975698Z sleeping for 30 seconds...
2018-09-26T17:28:26.83007554Z found 1 running pods
2018-09-26T17:28:27.076606954Z pod "bb-terminus-78897bc464-z8h48" deleted
2018-09-26T17:28:27.083966334Z sleeping for 30 seconds...
2018-09-26T17:28:57.300890609Z found 1 running pods
2018-09-26T17:28:57.523259717Z pod "bb-terminus-78897bc464-687nt" deleted
2018-09-26T17:28:57.529930851Z sleeping for 30 seconds...
2018-09-26T17:29:27.860318854Z found 1 running pods
2018-09-26T17:29:28.113539307Z pod "bb-terminus-78897bc464-q4wxz" deleted
2018-09-26T17:29:28.12074857Z sleeping for 30 seconds...
2018-09-26T17:29:58.294564828Z found 1 running pods
2018-09-26T17:29:58.535772668Z pod "bb-terminus-78897bc464-vblj2" deleted
2018-09-26T17:29:58.543236254Z sleeping for 30 seconds...
Pod deletion command: https://github.com/linkerd/linkerd-examples/blob/ed076ee1cc378de5b3823d1efaeb86c08352d3b4/lifecycle/linkerd1/lifecycle.yml#L398
Possibly related to #2079
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:9 (4 by maintainers)
Top GitHub Comments
@zoltrain quick update we are continuing to dig into this and have some promising leads, stay tuned.
No problem.
@siggy I’ve done some more digging today. I figured “maybe this has been introduced recently”. So I’ve been retroactively testing the minor release changes, using the MaxConnectionAge setup I had on Friday. Good news, it fails consistently, and I also found that something may have been introduced between version 1.4.3 and 1.4.4 that causes this issue. I tested 1.4.4 and it has the failures after around the 10 min mark, I’m currently running 1.4.3, it’s been running for 30 minutes without failure. I dug into the commit history between those releases and there was a lot of work done on H2 streams/connections. Finagle was upgraded, a stream buffer was replaced, and a status code guard for resets was removed.
If I had to guess, I’d say one of these might be causing this issue. https://github.com/linkerd/linkerd/commit/9876e3da13f1ab29365926e8455c334106397256 https://github.com/linkerd/linkerd/commit/ac64c5991df2d008c4e6855982273eca4e63f51c https://github.com/linkerd/linkerd/commit/62aa66e5d0c6abec77f6289d5d9249a102928397
I’m no Scala expert, so can’t really comment much on what’s going on in the above. I’ll leave that to the experts.
I’m going to run 1.4.3 continuously this afternoon to make sure it doesn’t hit that same barriers eventually. If not we’ll be looking at downgrading the release to hopefully stop this while the source of it is tracked down.