question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

kube-state-metrics timing out with agent v6.1.0-6.1.4 / v5.23.0

See original GitHub issue

We’ve just switched from agent v5 using our own YAML to v6.1.0 using the Helm chart and we appear to have hit the same problem as https://github.com/kubernetes/charts/issues/1466. We deploy kube-state-metrics separately rather than bundle it as a helm dependency. For that we’re on 1.2.0.

After the upgrade, most things are working ok but kube state metrics are not coming through any more when non of the rest of the system has changed.

This is what we see in the logs:

datadog-master-rcd8v datadog [ AGENT ] 2018-04-05 04:33:04 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='10.2.13.182', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 350, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 314, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 467, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='10.2.13.182', port=8080): Read timed out. (read timeout=1)\n"}]

I’ve confirmed 10.2.13.182 is the correct pod IP address; as mentioned none of that side of things has changed so the AD seems to still be working.

I went to the datadog pod in question and did a little investigation on the timing:

time curl -v --silent --output /dev/null --show-error --fail 10.2.13.182:8080/metrics

The curl responds fine with content but I’ve never got it to go below 1 second. However, it appears 1 second is hard coded: https://github.com/DataDog/integrations-core/blob/69e3a575ff5d9dc62703b9b9f7789d98e91a2ec5/datadog-checks-base/datadog_checks/checks/prometheus/mixins.py#L472

The payload is ~11214 bytes on that cluster. I tried increasing resources to the kube state metrics pod but wasn’t able to get it to go any faster - the stats for the kube-state-metrics container show it’s not even using close to its allocated cpu/memory.

I’m not sure how best to deal with this, as a start the timeout could be configurable but perhaps there are some overall plans to deal with this in some better way.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:55 (24 by maintainers)

github_iconTop GitHub Comments

3reactions
cknowlescommented, Apr 25, 2018

We had some 5.23.0 issues too on different nodes, reverting to 5.22.3 (docker tag 12.6.5223) worked to resolve that.

1reaction
cainellicommented, Apr 25, 2018

I was having the same issue running with 12.6.5230, downgrading worked for me, thanks @c-knowles. Is there a plan to rollout the fix on 5.x?

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found