question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle timeout when validating envoy config

See original GitHub issue

Describe the bug When the envoy config is updated, it is validated by running envoy --config-path <path> --mode validate. This currently has a fixed timeout of 5 seconds. However, when an actual timeout happens this is not properly handled and seems to crash the automatic reconfiguration of ambassador.

This ultimately leads to a state where ambassador still runs and handles requests properly, but is not able to receive new route mappings anymore. In this case only a restart of ambassador seems to help.

To Reproduce I’m not sure how one would reproduce such a timeout. At least in our case it seems to happen from time to time and results in the described behavior.

Looking into the code at https://github.com/datawire/ambassador/blob/9484c0b07465e1547cc57a8942d3374d4ef4cf66/ambassador/ambassador_diag/diagd.py#L873-L878 it seems that only subprocess.CalledProcessError is handled. According to the python documentation a timeout will result in a subprocess.TimeoutExpired error.

Log Output:

2019-04-30 15:01:07 diagd 0.60.1 [P54TAmbassadorEventWatcher] ERROR: could not reconfigure: Command '['envoy', '--config-path', '/ambassador/snapshots/econf-tmp.json', '--mode', 'validate']' timed out after 5 seconds
2019-04-30 15:01:07 diagd 0.60.1 [P54TAmbassadorEventWatcher] ERROR: Command '['envoy', '--config-path', '/ambassador/snapshots/econf-tmp.json', '--mode', 'validate']' timed out after 5 seconds
Traceback (most recent call last):
  File "/usr/lib/python3.6/subprocess.py", line 425, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.6/subprocess.py", line 863, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.6/subprocess.py", line 1535, in _communicate
    self._check_timeout(endtime, orig_timeout)
  File "/usr/lib/python3.6/subprocess.py", line 891, in _check_timeout
    raise TimeoutExpired(self.args, orig_timeout)
subprocess.TimeoutExpired: Command '['envoy', '--config-path', '/ambassador/snapshots/econf-tmp.json', '--mode', 'validate']' timed out after 5 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ambassador-0.0.0.dev0-py3.6.egg/ambassador_diag/diagd.py", line 590, in run
    self.load_config_watt(rqueue, url)
  File "/usr/lib/python3.6/site-packages/ambassador-0.0.0.dev0-py3.6.egg/ambassador_diag/diagd.py", line 708, in load_config_watt
    self._load_ir(rqueue, aconf, fetcher, scc, snapshot)
  File "/usr/lib/python3.6/site-packages/ambassador-0.0.0.dev0-py3.6.egg/ambassador_diag/diagd.py", line 727, in _load_ir
    if not self.validate_envoy_config(config=ads_config):
  File "/usr/lib/python3.6/site-packages/ambassador-0.0.0.dev0-py3.6.egg/ambassador_diag/diagd.py", line 874, in validate_envoy_config
    odict['output'] = subprocess.check_output(command, stderr=subprocess.STDOUT, timeout=5)
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 430, in run
    stderr=stderr)
subprocess.TimeoutExpired: Command '['envoy', '--config-path', '/ambassador/snapshots/econf-tmp.json', '--mode', 'validate']' timed out after 5 seconds

Expected behavior In case a timeout happens it should be properly handled and not crash some parts of ambassador. Ideally we would also get the output of the subprocess on a TimeoutExpired error to further debug potential causes of the timeout.

Versions:

  • Ambassador: 0.60.1
  • Kubernetes environment: Google Kubernetes Engine
  • Kubernetes Version: v1.11.7-gke.12

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jwmcommented, Aug 31, 2019

Yup, the corresponding PR fixed this issue for us.

0reactions
jfrabautecommented, Jun 5, 2020

I sent a PR for the logging problem: https://github.com/datawire/ambassador/pull/2766

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handle timeout when validating envoy config · Issue #1478
When the envoy config is updated, it is validated by running envoy --config-path <path> --mode validate . This currently has a fixed timeout...
Read more >
Developers - Handle timeout when validating envoy config -
When the envoy config is updated, it is validated by running envoy --config-path <path> --mode validate . This currently has a fixed timeout...
Read more >
How do I configure timeouts? - Envoy Proxy
A route timeout is the amount of time that Envoy will wait for the upstream to respond with a complete response. · The...
Read more >
Override Envoy's local app request timeout - Hashicorp Support
The service router provides the request timeout parameter to configure the timeout in Envoy sidecars. By default the timeout is 15 seconds, ...
Read more >
Contour API Reference - Documentation
If left empty (default value), the name “contour-envoy-healthcheck” will be used. ... If the TLS configuration requires client certificate validation, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found