Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

kube-state-metrics timing out with agent v6.1.0-6.1.4 / v5.23.0

See original GitHub issue

We’ve just switched from agent v5 using our own YAML to v6.1.0 using the Helm chart and we appear to have hit the same problem as https://github.com/kubernetes/charts/issues/1466. We deploy kube-state-metrics separately rather than bundle it as a helm dependency. For that we’re on 1.2.0.

After the upgrade, most things are working ok but kube state metrics are not coming through any more when non of the rest of the system has changed.

This is what we see in the logs:

datadog-master-rcd8v datadog [ AGENT ] 2018-04-05 04:33:04 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='10.2.13.182', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 350, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 314, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 467, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='10.2.13.182', port=8080): Read timed out. (read timeout=1)\n"}]

I’ve confirmed 10.2.13.182 is the correct pod IP address; as mentioned none of that side of things has changed so the AD seems to still be working.

I went to the datadog pod in question and did a little investigation on the timing:

time curl -v --silent --output /dev/null --show-error --fail 10.2.13.182:8080/metrics

The curl responds fine with content but I’ve never got it to go below 1 second. However, it appears 1 second is hard coded: https://github.com/DataDog/integrations-core/blob/69e3a575ff5d9dc62703b9b9f7789d98e91a2ec5/datadog-checks-base/datadog_checks/checks/prometheus/mixins.py#L472

The payload is ~11214 bytes on that cluster. I tried increasing resources to the kube state metrics pod but wasn’t able to get it to go any faster - the stats for the kube-state-metrics container show it’s not even using close to its allocated cpu/memory.

I’m not sure how best to deal with this, as a start the timeout could be configurable but perhaps there are some overall plans to deal with this in some better way.