Multiline events not flushing until next event occurs
See original GitHub issueI have the following configuration:
logs:
my-app:
from:
pod: my-app
container: my-app
multiline:
firstline: /^\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d,\d{3}[+-]\d\d \[[a-zA-Z]+\] \[/
sourcetype: app:log4j2
I am getting events with multi-lines as I expect. The problem is that the event is buffered and isn’t delivered until another event has started. The timestamp in splunk is the timestamp of when the next events starts buffering. For example, if event-a is delivered, then it buffers somewhere in fluentd. Let’s say 30 seconds pass, and then event-b comes into the log, at that point event-a is sent to Splunk with a timestamp with +30 seconds.
At first I thought perhaps something was wrong with the flush_interval for the concat plugin: In splunk-kubernetes-logging/templates/configMap.yaml, line 160: flush_interval {{ $logDef.multiline.flushInterval | default “5s” }} I thought this value is supposed to be 5 instead of 5s. See fluentd concat documentation. However, that update made no difference. Also, I can see that this flush does work. From the fluentd log:
2019-09-20 18:21:22 +0000 [info]: #0 Timeout flush: tail.containers.var.log.containers.my-app-655555b857-jlknz_default_my-app-80fb6e9b4517fbb754758b2e821464384bd30a5e2ce4f538cd050ef4c3e1c281.log:stdout
So I can see fluentd saying the concat flush occurred, but the event does not get sent.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
I found the root cause: the timeout handling and normal flow should target a separate label section for common processing. Please see the fluentd concat plugin documentation regarding Handle timeout log lines the same as normal logs
In splunk-kubernetes-logging/templates/configMap.yaml output.conf section, the timeout_label is
@SPLUNK
which is the label section currently executing. I created a new label section named@HEC
and used that as the target of the timeout_label processing and the normal log processing by adding a relabel.Once I did this, the concat timeout processing sent the event to the HEC label and was processed at the correct time.
this fix has been merged https://github.com/splunk/splunk-connect-for-kubernetes/pull/369 and released as version 1.4.1. please reopen if it is still not resolved. Thank you!