HTTP 504 Gateway timeouts after upgrading to linkerd 1.2.0
See original GitHub issueFiling a linkerd issue
Issue Type: Bug Report
What happened: We have a rails application which serves web pages and it was running as a Kubernetes deployment. Every time a deployment is made and new pods come up, Linkerd gives 504 gateway timeout and when I checked Linkerd logs I can see that it was still making requests to the old endpoints. (Not sure if this is a config issue). It get’s fixed on its own after some time.
What you expected to happen: Endpoints should be instantly updated whenever there is a deployment.
How to reproduce it (as minimally and precisely as possible): Rails Puma Server serving web pages and making requests to the service just after a deployment.
Anything else we need to know?: Our router config:
- protocol: http
label: webapp-external
identifier:
kind: io.l5d.header.token
header: Host
interpreter:
kind: io.l5d.namerd
dst: /$/inet/namerd.linkerd.svc.cluster.local/4100
namespace: external
transformers:
- kind: io.l5d.k8s.daemonset
namespace: linkerd
port: webapp-ingress
service: linkerd-internal
servers:
- port: 4143
ip: 0.0.0.0
client:
kind: io.l5d.global
loadBalancer:
kind: ewma
enableProbation: false
maxEffort: 5
decayTimeMs: 10
failureAccrual:
kind: io.l5d.consecutiveFailures
failures: 5
service:
kind: io.l5d.global
- protocol: http
label: webapp-ingress
identifier:
kind: io.l5d.header.token
header: Host
interpreter:
kind: io.l5d.namerd
dst: /$/inet/namerd.linkerd.svc.cluster.local/4100
namespace: external
transformers:
- kind: io.l5d.k8s.localnode
servers:
- port: 4145
ip: 0.0.0.0
client:
kind: io.l5d.global
loadBalancer:
kind: ewma
enableProbation: false
maxEffort: 5
decayTimeMs: 10
failureAccrual:
kind: io.l5d.consecutiveFailures
failures: 5
service:
kind: io.l5d.global
Environment:
- linkerd/namerd version, config files: 1.2.0/1.2.0
- Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes
- Cloud provider or hardware configuration: AWS
Issue Analytics
- State:
- Created 6 years ago
- Comments:50 (25 by maintainers)
Top Results From Across the Web
How to Fix the 504 Gateway Timeout Error on Your Site - Kinsta
The 504 (Gateway Timeout) status code indicates that the server, while acting as a gateway or proxy, did not receive a timely response...
Read more >What is a 504 Gateway Timeout error, and how to fix it?
The 504 (Gateway Timeout) status code indicates that the server while acting as a gateway or proxy, did not receive a timely response...
Read more >php-fpm memory issues / 504 Gateway Time-out - Linode
I was on a Linode 768 plan. I upgraded to the 1024 product to take into consideration I was perhaps trying to do...
Read more >504 gateway time-out error when saving a category with 1k+ ...
This article suggests a solution for the timeout issue you might have, when performing operations with large categories (1k+ plus products).
Read more >12 Quick Ways to Fix HTTP 504 Gateway Timeout Error Code
What does 504 gateway timeout mean? 504 Gateway timeout error is an HTTP status code. It appears when one server does not receive...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@bseibel I’ve been looking into this some more today and I agree that this issue is almost certainly related to kubernetes/kubernetes#35068. That also explains why our unit tests haven’t caught this issue, as the tests for handling the “too old resource version” response set the response status code to 410.
So unfortunately we’re still seeing this issue even with the fix here, we now see in debug logs
however where we do see lines like (and pardon my slightly filtered log lines without the endpoints):
for most pre-existing endpoints which is fine, and expected, but a service that was added after the restarting watch line doesn’t appear in the logs at all, and linkerd ends up giving us “No hosts are available”.
So far we’ve only seen this happen in production, but it happens pretty frequently, sometimes minutes after we kick our namerd pods. Linkerd isnt logging any issues about connectivity to namerd. I’m out of town at the moment and I’m going to try to narrow down the issue further when I’m back later this week, but if theres anything specific you would like me to poke at to help narrow down the issue please let me know.