Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Certain metrics unavailable after a connection timeout

See original GitHub issue

What happened:

After a timeout connecting to splunk, specific metrics are no longer sent to splunk even though the server is readily available.

A log dump from the running pod (splunk metrics aggregator) indicates that this happened on the 12th July.

2019-07-12 09:45:21 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:resource_usage_scraper error_class=RestClient::Exceptions::OpenTimeout error="Timed out connecting to server"
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:731:in `rescue in transmit'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:642:in `transmit'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-k8s-metrics-agg-1.1.0/lib/fluent/plugin/in_kubernetes_metrics_aggregator.rb:536:in `scrape_resource_usage_metrics'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2019-07-12 09:45:21 +0000 [error]: #0 Timer detached. title=:resource_usage_scraper

Some of the metrics that are not available are:

kube.node.memory.allocatable kube.cluster.cpu.request kube.cluster.cpu.limit

What you expected to happen:

Metrics to be sent to splunk when the host is available again. Right now, the pod is in a ready state, but the process running inside it is not collecting logs and forwarding them to splunk.

How to reproduce it (as minimally and precisely as possible):

Restrict the pod from being able to communicate with the host so that the connection times out.

Environment:

Splunk metrics aggregator image:

splunk/k8s-metrics-aggr:1.1.0

Kubernetes version (use kubectl version):

v1.13.5

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:13 (6 by maintainers)

Top GitHub Comments

1reaction

chaitanyaphalakcommented, Sep 24, 2019

We will consider it for a future release for the mertics-agg plugin.

1reaction

cwebbtwcommented, Aug 14, 2019

Hey @chaitanyaphalak, I’ll see what I can come up with in a PR, thanks!

Top Results From Across the Web

A reason for unexplained connection timeouts on Kubernetes ...

The Linux Kernel has a known race condition when doing source network address translation (SNAT) that can lead to SYN packets being dropped....

Getting java.net.SocketTimeoutException: Connection timed ...

Frequently I'm getting java.net.SocketTimeoutException: Connection timed out exception while communicating with the server. Some times it will work perfectly ...

Why isn't the unified CloudWatch agent pushing my metrics or ...

I have configured the unified CloudWatch agent on my Amazon Elastic Compute Cloud (Amazon EC2) instance to post metrics and logs to Amazon ......

Changing connection timeouts to your origin

Connection timeouts to your origin server control how long Fastly will wait for a response from your origin server before exiting with an ......

Troubleshoot Azure Cache for Redis latency and timeouts

Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis server but not...