Fluentd tuning for logging pods on big nodes.
See original GitHub issueWhat happened:
We have deployed this to all of our OpenShift 4 bare metal clusters and the clusters with the most activity and pods/node are not able to send all container logs to splunk and/or have a long (30 min.) queue/delay in becoming visible in the splunk indexers.
some logs from a splunk-kubernetes-logging pod with info-level logs filtered out: 2020-11-06 14:03:32 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=92.02205770404544 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:04:40 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=67.76551360095618 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:05:36 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=56.591755325032864 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:06:35 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=58.183909379993565 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:07:42 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=67.70138586801477 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:08:38 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=55.92357094195904 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:09:47 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=68.77938993298449 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:10:42 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=54.5333183449693 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:12:06 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=83.83802559098694 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c” 2020-11-06 14:13:16 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=70.05958420498064 slow_flush_log_threshold=20.0 plugin_id=“object:2ade063e1c3c”
What you expected to happen: We expect the splunk-kubernetes-logging pods to be able to send all of the container logs from the host to the splunk HEC in real time.
How to reproduce it (as minimally and precisely as possible): Use the default settings in the 1.4.3 branch of this repo, and run over 100 pods/node.
Anything else we need to know?: Our application nodes are 384 GB/RAM. Our average pod size is 1.5-2 GB/RAM. RH CoreOS uses cri-o, not docker, so the log file locations are different. We’ve tried changing the buffer type to “file” and the “retry_forever=true” parameter. This results in less dropped logs, but a delay of at least 30 minutes for them to show in splunk.
Environment:
-
Kubernetes version (use
kubectl version
): $ kubectl version Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.2-0-g52c56ce”, GitCommit:“d7f3ccf9a5bdc96ba92e31526cf014b3de4c46aa”, GitTreeState:“clean”, BuildDate:“2020-09-16T15:25:59Z”, GoVersion:“go1.13.4”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“18+”, GitVersion:“v1.18.3+47c0e71”, GitCommit:“47c0e71”, GitTreeState:“clean”, BuildDate:“2020-09-17T23:10:07Z”, GoVersion:“go1.13.4”, Compiler:“gc”, Platform:“linux/amd64”} -
OS (e.g:
cat /etc/os-release
): $ cat /etc/os-release NAME=“Red Hat Enterprise Linux CoreOS” VERSION=“45.82.202009181447-0” VERSION_ID=“4.5” OPENSHIFT_VERSION=“4.5” RHEL_VERSION=“8.2” PRETTY_NAME=“Red Hat Enterprise Linux CoreOS 45.82.202009181447-0 (Ootpa)” ID=“rhcos” ID_LIKE=“rhel fedora” ANSI_COLOR=“0;31” CPE_NAME=“cpe:/o:redhat:enterprise_linux:8::coreos” HOME_URL=“https://www.redhat.com/” BUG_REPORT_URL=“https://bugzilla.redhat.com/” REDHAT_BUGZILLA_PRODUCT=“OpenShift Container Platform” REDHAT_BUGZILLA_PRODUCT_VERSION=“4.5” REDHAT_SUPPORT_PRODUCT=“OpenShift Container Platform” REDHAT_SUPPORT_PRODUCT_VERSION=“4.5” OSTREE_VERSION=‘45.82.202009181447-0’ -
Splunk version: 8.0.3
-
Splunk Connect for Kubernetes helm chart version: 1.4.3
-
Others:
Issue Analytics
- State:
- Created 3 years ago
- Comments:20 (3 by maintainers)
@MikeTomlin19 yes the UF should in Openshift under cri-o. I run under containerd in Microk8s without issue. here is a sample test.
https://mattymo.io/code/mattymo/ta-k8s-logging/-/snippets/2
basic sample config here: https://mattymo.io/code/mattymo/ta-k8s-logging - still work in progress as we explore multiline and metadata enrichment with ingest time lookups and evals
No helm charts, likely destined for splunk operator -> https://github.com/splunk/splunk-operator would you mind opening an issue there and requesting UF node agent and/or sidecar there?
@MikeTomlin19 can you share your values.yml ?
did you configure
flush_thread_count
in buffers?how’s the CPU usage of fluentd while you get
slow_flush_log_threshold
?