question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Log loss observed at 15MB/s@3000 events/second

See original GitHub issue

Team,

I notice that at around 5000 bytes per message and 15MB/second the fluentd Splunk connect forwarder fails to forward nearly 2/3rd of the generated logs.

We don’t see any error logs generated from the daemonset. We don’t see any 5xx from Splunk HEC endpoint. Splunk HEC load balancer does not indicate any latency problem. The overall latency of logs that successfully make to Splunk has less than 10 seconds of TP99 latency. Buffer flush retry_count at the source is less than 5 Buffer size at the source is less than 40mb. Provisioned 600mb. Queue size less than 2. CPU less than 70%

Observation: -JQ plugin tends to utilize 30% of a single CPU core at peak load. -Overall CPU across all cores is less than 70%

Configuration delta from default:

        <buffer>
          @type memory
          chunk_limit_records 100000
          chunk_limit_size 200m
          flush_interval 5s
          flush_thread_count 5
          overflow_action throw_exception
          retry_max_times 3
          total_limit_size 600m
        </buffer>

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
matthewmodestinocommented, Oct 3, 2019

The two most common causes for missing logs I see in environments is due to the log rotation settings being too small for high velocity logging. Please ensure docker or logrotate is set to sane production/high volume size. 10MB is not enough (default). More like 1GB/10gb, is probably more reasonable.

The other issue could be filling the buffer, though you would see that in the pod logs. Our customers who have stressed test it generally opt to use a fluentd file buffer to ensure they dont lose data if network is down or splunk is overloaded.

this is just a sample, exact configs will vary.

 <buffer>
          @type file
          path  /var/log/splunk-fluentd/
          chunk_limit_size 8MB
          total_limit_size 1GB
          flush_interval 5s
          flush_thread_count 1
          overflow_action block
          retry_forever true
        </buffer>

https://docs.fluentd.org/buffer/file

Let me know if these get you sorted. If not, if you open a support ticket and we can gather info about your enviro and assist!

1reaction
bhattchaitanyacommented, Oct 8, 2019

We did see significant improvement by increasing our log rotation file size limits in the Kubernetes cluster.spec file. Thanks for hinting the enable_watch_time flag! We will play around with that. Can you elaborate more on the “inputs that account for rolled logs”. What does this actually mean?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Intuition behind Log-loss score - Towards Data Science
Log -loss is indicative of how close the prediction probability is to ... As shown above, log-loss value is calculated for each observation...
Read more >
What's considered a good Log Loss in Machine Learning
Why Log Loss ? When doing a classification model, you have a multitude of metrics of performance available to optimise your models, ...
Read more >
What's considered a good log loss? - Cross Validated
So, when classes are very unbalanced (prevalence <2%), a logloss of 0.1 can actually be very bad ! Such as an accuracy of...
Read more >
What is Log Loss? - Kaggle
Log Loss is a slight twist on something called the Likelihood Function. In fact, Log Loss is -1 * the log of the...
Read more >
Log Loss Function Explained by Experts - Dasha.AI
Logarithmic loss indicates how close a prediction probability comes to the actual/corresponding true value. Here is the log loss formula: Binary ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found