Certain metrics unavailable after a connection timeout
See original GitHub issueWhat happened:
After a timeout connecting to splunk, specific metrics are no longer sent to splunk even though the server is readily available.
A log dump from the running pod (splunk metrics aggregator) indicates that this happened on the 12th July.
2019-07-12 09:45:21 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:resource_usage_scraper error_class=RestClient::Exceptions::OpenTimeout error="Timed out connecting to server"
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:731:in `rescue in transmit'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:642:in `transmit'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-k8s-metrics-agg-1.1.0/lib/fluent/plugin/in_kubernetes_metrics_aggregator.rb:536:in `scrape_resource_usage_metrics'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
2019-07-12 09:45:21 +0000 [error]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.4.0/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2019-07-12 09:45:21 +0000 [error]: #0 Timer detached. title=:resource_usage_scraper
Some of the metrics that are not available are:
kube.node.memory.allocatable kube.cluster.cpu.request kube.cluster.cpu.limit
What you expected to happen:
Metrics to be sent to splunk when the host is available again. Right now, the pod is in a ready state, but the process running inside it is not collecting logs and forwarding them to splunk.
How to reproduce it (as minimally and precisely as possible):
Restrict the pod from being able to communicate with the host so that the connection times out.
Environment:
- Splunk metrics aggregator image:
splunk/k8s-metrics-aggr:1.1.0
- Kubernetes version (use
kubectl version
):
v1.13.5
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:13 (6 by maintainers)
We will consider it for a future release for the mertics-agg plugin.
Hey @chaitanyaphalak, I’ll see what I can come up with in a PR, thanks!