Handle timeout when validating envoy config
See original GitHub issueDescribe the bug
When the envoy config is updated, it is validated by running envoy --config-path <path> --mode validate
. This currently has a fixed timeout of 5 seconds. However, when an actual timeout happens this is not properly handled and seems to crash the automatic reconfiguration of ambassador.
This ultimately leads to a state where ambassador still runs and handles requests properly, but is not able to receive new route mappings anymore. In this case only a restart of ambassador seems to help.
To Reproduce I’m not sure how one would reproduce such a timeout. At least in our case it seems to happen from time to time and results in the described behavior.
Looking into the code at https://github.com/datawire/ambassador/blob/9484c0b07465e1547cc57a8942d3374d4ef4cf66/ambassador/ambassador_diag/diagd.py#L873-L878 it seems that only subprocess.CalledProcessError
is handled. According to the python documentation a timeout will result in a subprocess.TimeoutExpired
error.
Log Output:
2019-04-30 15:01:07 diagd 0.60.1 [P54TAmbassadorEventWatcher] ERROR: could not reconfigure: Command '['envoy', '--config-path', '/ambassador/snapshots/econf-tmp.json', '--mode', 'validate']' timed out after 5 seconds
2019-04-30 15:01:07 diagd 0.60.1 [P54TAmbassadorEventWatcher] ERROR: Command '['envoy', '--config-path', '/ambassador/snapshots/econf-tmp.json', '--mode', 'validate']' timed out after 5 seconds
Traceback (most recent call last):
File "/usr/lib/python3.6/subprocess.py", line 425, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 863, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "/usr/lib/python3.6/subprocess.py", line 1535, in _communicate
self._check_timeout(endtime, orig_timeout)
File "/usr/lib/python3.6/subprocess.py", line 891, in _check_timeout
raise TimeoutExpired(self.args, orig_timeout)
subprocess.TimeoutExpired: Command '['envoy', '--config-path', '/ambassador/snapshots/econf-tmp.json', '--mode', 'validate']' timed out after 5 seconds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ambassador-0.0.0.dev0-py3.6.egg/ambassador_diag/diagd.py", line 590, in run
self.load_config_watt(rqueue, url)
File "/usr/lib/python3.6/site-packages/ambassador-0.0.0.dev0-py3.6.egg/ambassador_diag/diagd.py", line 708, in load_config_watt
self._load_ir(rqueue, aconf, fetcher, scc, snapshot)
File "/usr/lib/python3.6/site-packages/ambassador-0.0.0.dev0-py3.6.egg/ambassador_diag/diagd.py", line 727, in _load_ir
if not self.validate_envoy_config(config=ads_config):
File "/usr/lib/python3.6/site-packages/ambassador-0.0.0.dev0-py3.6.egg/ambassador_diag/diagd.py", line 874, in validate_envoy_config
odict['output'] = subprocess.check_output(command, stderr=subprocess.STDOUT, timeout=5)
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 430, in run
stderr=stderr)
subprocess.TimeoutExpired: Command '['envoy', '--config-path', '/ambassador/snapshots/econf-tmp.json', '--mode', 'validate']' timed out after 5 seconds
Expected behavior
In case a timeout happens it should be properly handled and not crash some parts of ambassador. Ideally we would also get the output of the subprocess on a TimeoutExpired
error to further debug potential causes of the timeout.
Versions:
- Ambassador: 0.60.1
- Kubernetes environment: Google Kubernetes Engine
- Kubernetes Version: v1.11.7-gke.12
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
Yup, the corresponding PR fixed this issue for us.
I sent a PR for the logging problem: https://github.com/datawire/ambassador/pull/2766