Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Log loss observed at 15MB/s@3000 events/second

See original GitHub issue

Team,

I notice that at around 5000 bytes per message and 15MB/second the fluentd Splunk connect forwarder fails to forward nearly 2/3rd of the generated logs.

We don’t see any error logs generated from the daemonset. We don’t see any 5xx from Splunk HEC endpoint. Splunk HEC load balancer does not indicate any latency problem. The overall latency of logs that successfully make to Splunk has less than 10 seconds of TP99 latency. Buffer flush retry_count at the source is less than 5 Buffer size at the source is less than 40mb. Provisioned 600mb. Queue size less than 2. CPU less than 70%

Observation: -JQ plugin tends to utilize 30% of a single CPU core at peak load. -Overall CPU across all cores is less than 70%

Configuration delta from default:

        <buffer>
          @type memory
          chunk_limit_records 100000
          chunk_limit_size 200m
          flush_interval 5s
          flush_thread_count 5
          overflow_action throw_exception
          retry_max_times 3
          total_limit_size 600m
        </buffer>

Issue Analytics

State:
Created 4 years ago
Comments:14 (4 by maintainers)

Top GitHub Comments

2reactions

matthewmodestinocommented, Oct 3, 2019

The two most common causes for missing logs I see in environments is due to the log rotation settings being too small for high velocity logging. Please ensure docker or logrotate is set to sane production/high volume size. 10MB is not enough (default). More like 1GB/10gb, is probably more reasonable.

The other issue could be filling the buffer, though you would see that in the pod logs. Our customers who have stressed test it generally opt to use a fluentd file buffer to ensure they dont lose data if network is down or splunk is overloaded.

this is just a sample, exact configs will vary.

 <buffer>
          @type file
          path  /var/log/splunk-fluentd/
          chunk_limit_size 8MB
          total_limit_size 1GB
          flush_interval 5s
          flush_thread_count 1
          overflow_action block
          retry_forever true
        </buffer>

https://docs.fluentd.org/buffer/file

Let me know if these get you sorted. If not, if you open a support ticket and we can gather info about your enviro and assist!

1reaction

bhattchaitanyacommented, Oct 8, 2019

We did see significant improvement by increasing our log rotation file size limits in the Kubernetes cluster.spec file. Thanks for hinting the enable_watch_time flag! We will play around with that. Can you elaborate more on the “inputs that account for rolled logs”. What does this actually mean?