kube-state-metrics timing out with agent v6.1.0-6.1.4 / v5.23.0
See original GitHub issueWe’ve just switched from agent v5 using our own YAML to v6.1.0 using the Helm chart and we appear to have hit the same problem as https://github.com/kubernetes/charts/issues/1466. We deploy kube-state-metrics separately rather than bundle it as a helm dependency. For that we’re on 1.2.0.
After the upgrade, most things are working ok but kube state metrics are not coming through any more when non of the rest of the system has changed.
This is what we see in the logs:
datadog-master-rcd8v datadog [ AGENT ] 2018-04-05 04:33:04 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='10.2.13.182', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n self.check(copy.deepcopy(self.instances[0]))\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 350, in process\n for metric in self.scrape_metrics(endpoint):\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 314, in scrape_metrics\n response = self.poll(endpoint)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 467, in poll\n response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n return request('get', url, params=params, **kwargs)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n return session.request(method=method, url=url, **kwargs)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n resp = self.send(prep, **send_kwargs)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n r = adapter.send(request, **kwargs)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='10.2.13.182', port=8080): Read timed out. (read timeout=1)\n"}]
I’ve confirmed 10.2.13.182
is the correct pod IP address; as mentioned none of that side of things has changed so the AD seems to still be working.
I went to the datadog pod in question and did a little investigation on the timing:
time curl -v --silent --output /dev/null --show-error --fail 10.2.13.182:8080/metrics
The curl responds fine with content but I’ve never got it to go below 1 second. However, it appears 1 second is hard coded: https://github.com/DataDog/integrations-core/blob/69e3a575ff5d9dc62703b9b9f7789d98e91a2ec5/datadog-checks-base/datadog_checks/checks/prometheus/mixins.py#L472
The payload is ~11214 bytes on that cluster. I tried increasing resources to the kube state metrics pod but wasn’t able to get it to go any faster - the stats for the kube-state-metrics container show it’s not even using close to its allocated cpu/memory.
I’m not sure how best to deal with this, as a start the timeout could be configurable but perhaps there are some overall plans to deal with this in some better way.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:55 (24 by maintainers)
Top GitHub Comments
We had some 5.23.0 issues too on different nodes, reverting to 5.22.3 (docker tag 12.6.5223) worked to resolve that.
I was having the same issue running with
12.6.5230
, downgrading worked for me, thanks @c-knowles. Is there a plan to rollout the fix on 5.x?