question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fluentd tuning for logging pods on big nodes.

See original GitHub issue

What happened:
We have deployed this to all of our OpenShift 4 bare metal clusters and the clusters with the most activity and pods/node are not able to send all container logs to splunk and/or have a long (30 min.) queue/delay in becoming visible in the splunk indexers.

some logs from a splunk-kubernetes-logging pod with info-level logs filtered out: 2020-11-06 14:03:32 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=92.02205770404544 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:04:40 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=67.76551360095618 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:05:36 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=56.591755325032864 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:06:35 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=58.183909379993565 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:07:42 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=67.70138586801477 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:08:38 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=55.92357094195904 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:09:47 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=68.77938993298449 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:10:42 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=54.5333183449693 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:12:06 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=83.83802559098694 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:13:16 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=70.05958420498064 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c”

What you expected to happen: We expect the splunk-kubernetes-logging pods to be able to send all of the container logs from the host to the splunk HEC in real time.

How to reproduce it (as minimally and precisely as possible): Use the default settings in the 1.4.3 branch of this repo, and run over 100 pods/node.

Anything else we need to know?: Our application nodes are 384 GB/RAM. Our average pod size is 1.5-2 GB/RAM. RH CoreOS uses cri-o, not docker, so the log file locations are different. We’ve tried changing the buffer type to “file” and the “retry_forever=true” parameter. This results in less dropped logs, but a delay of at least 30 minutes for them to show in splunk.

Environment:

  • Kubernetes version (use kubectl version): $ kubectl version Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.2-0-g52c56ce”, GitCommit:“d7f3ccf9a5bdc96ba92e31526cf014b3de4c46aa”, GitTreeState:“clean”, BuildDate:“2020-09-16T15:25:59Z”, GoVersion:“go1.13.4”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“18+”, GitVersion:“v1.18.3+47c0e71”, GitCommit:“47c0e71”, GitTreeState:“clean”, BuildDate:“2020-09-17T23:10:07Z”, GoVersion:“go1.13.4”, Compiler:“gc”, Platform:“linux/amd64”}

  • OS (e.g: cat /etc/os-release): $ cat /etc/os-release NAME=“Red Hat Enterprise Linux CoreOS” VERSION=“45.82.202009181447-0” VERSION_ID=“4.5” OPENSHIFT_VERSION=“4.5” RHEL_VERSION=“8.2” PRETTY_NAME=“Red Hat Enterprise Linux CoreOS 45.82.202009181447-0 (Ootpa)” ID=“rhcos” ID_LIKE=“rhel fedora” ANSI_COLOR=“0;31” CPE_NAME=“cpe:/o:redhat:enterprise_linux:8::coreos” HOME_URL=“https://www.redhat.com/” BUG_REPORT_URL=“https://bugzilla.redhat.com/” REDHAT_BUGZILLA_PRODUCT=“OpenShift Container Platform” REDHAT_BUGZILLA_PRODUCT_VERSION=“4.5” REDHAT_SUPPORT_PRODUCT=“OpenShift Container Platform” REDHAT_SUPPORT_PRODUCT_VERSION=“4.5” OSTREE_VERSION=‘45.82.202009181447-0’

  • Splunk version: 8.0.3

  • Splunk Connect for Kubernetes helm chart version: 1.4.3

  • Others:

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:20 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
matthewmodestinocommented, Nov 16, 2020

@MikeTomlin19 yes the UF should in Openshift under cri-o. I run under containerd in Microk8s without issue. here is a sample test.

https://mattymo.io/code/mattymo/ta-k8s-logging/-/snippets/2

basic sample config here: https://mattymo.io/code/mattymo/ta-k8s-logging - still work in progress as we explore multiline and metadata enrichment with ingest time lookups and evals

No helm charts, likely destined for splunk operator -> https://github.com/splunk/splunk-operator would you mind opening an issue there and requesting UF node agent and/or sidecar there?

1reaction
vinzentcommented, Feb 7, 2021

@MikeTomlin19 can you share your values.yml ?

did you configure flush_thread_count in buffers?

how’s the CPU usage of fluentd while you get slow_flush_log_threshold ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance Tuning - Fluentd
Performance Tuning. This article describes how to optimize Fluentd performance within a single process. If your traffic is up to 5,000 messages/sec, ...
Read more >
7 Performance and scaling - Logging in Action: With Fluentd ...
This chapter covers. Tuning Fluentd configuration to maximize resources using workers. Distributing workloads with fan-in and fan-out deployment patterns.
Read more >
1445053 – Fluentd logger is unable to keep up with high ...
The logs come from 3 working nodes averaging 80 pods per node. 2. Configured fluentd to export logs to an external fluentd server....
Read more >
Tuning fluentd for big worker servers : r/openshift - Reddit
I'm using default openshift-logging config for forwarding with unlimited cpu and 2Gi memory limit for a pod. Any tips how to optimize fluentd?...
Read more >
Cluster-level Logging in Kubernetes with Fluentd - Medium
The DaemonSet controller will ensure that for every node running in your cluster you have a copy of the logging agent pod. The...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found