question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Certain metrics unavailable after a connection timeout

See original GitHub issue

What happened:

After a timeout connecting to splunk, specific metrics are no longer sent to splunk even though the server is readily available.

A log dump from the running pod (splunk metrics aggregator) indicates that this happened on the 12th July.

2019-07-12 09:45:21 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:resource_usage_scraper error_class=RestClient::Exceptions::OpenTimeout error="Timed out connecting to server"
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:731:in `rescue in transmit'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:642:in `transmit'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-k8s-metrics-agg-1.1.0/lib/fluent/plugin/in_kubernetes_metrics_aggregator.rb:536:in `scrape_resource_usage_metrics'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2019-07-12 09:45:21 +0000 [error]: #0 Timer detached. title=:resource_usage_scraper

Some of the metrics that are not available are:

kube.node.memory.allocatable kube.cluster.cpu.request kube.cluster.cpu.limit

What you expected to happen:

Metrics to be sent to splunk when the host is available again. Right now, the pod is in a ready state, but the process running inside it is not collecting logs and forwarding them to splunk.

How to reproduce it (as minimally and precisely as possible):

Restrict the pod from being able to communicate with the host so that the connection times out.

Environment:

  • Splunk metrics aggregator image:

splunk/k8s-metrics-aggr:1.1.0

  • Kubernetes version (use kubectl version):

v1.13.5

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:13 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
chaitanyaphalakcommented, Sep 24, 2019

We will consider it for a future release for the mertics-agg plugin.

1reaction
cwebbtwcommented, Aug 14, 2019

Hey @chaitanyaphalak, I’ll see what I can come up with in a PR, thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

A reason for unexplained connection timeouts on Kubernetes ...
The Linux Kernel has a known race condition when doing source network address translation (SNAT) that can lead to SYN packets being dropped....
Read more >
Getting java.net.SocketTimeoutException: Connection timed ...
Frequently I'm getting java.net.SocketTimeoutException: Connection timed out exception while communicating with the server. Some times it will work perfectly ...
Read more >
Why isn't the unified CloudWatch agent pushing my metrics or ...
I have configured the unified CloudWatch agent on my Amazon Elastic Compute Cloud (Amazon EC2) instance to post metrics and logs to Amazon ......
Read more >
Changing connection timeouts to your origin
Connection timeouts to your origin server control how long Fastly will wait for a response from your origin server before exiting with an ......
Read more >
Troubleshoot Azure Cache for Redis latency and timeouts
Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis server but not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found