Log loss observed at 15MB/s@3000 events/second
See original GitHub issueTeam,
I notice that at around 5000 bytes per message and 15MB/second the fluentd Splunk connect forwarder fails to forward nearly 2/3rd of the generated logs.
We don’t see any error logs generated from the daemonset. We don’t see any 5xx from Splunk HEC endpoint. Splunk HEC load balancer does not indicate any latency problem. The overall latency of logs that successfully make to Splunk has less than 10 seconds of TP99 latency. Buffer flush retry_count at the source is less than 5 Buffer size at the source is less than 40mb. Provisioned 600mb. Queue size less than 2. CPU less than 70%
Observation: -JQ plugin tends to utilize 30% of a single CPU core at peak load. -Overall CPU across all cores is less than 70%
Configuration delta from default:
<buffer>
@type memory
chunk_limit_records 100000
chunk_limit_size 200m
flush_interval 5s
flush_thread_count 5
overflow_action throw_exception
retry_max_times 3
total_limit_size 600m
</buffer>
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (4 by maintainers)
Top Results From Across the Web
Intuition behind Log-loss score - Towards Data Science
Log -loss is indicative of how close the prediction probability is to ... As shown above, log-loss value is calculated for each observation...
Read more >What's considered a good Log Loss in Machine Learning
Why Log Loss ? When doing a classification model, you have a multitude of metrics of performance available to optimise your models, ...
Read more >What's considered a good log loss? - Cross Validated
So, when classes are very unbalanced (prevalence <2%), a logloss of 0.1 can actually be very bad ! Such as an accuracy of...
Read more >What is Log Loss? - Kaggle
Log Loss is a slight twist on something called the Likelihood Function. In fact, Log Loss is -1 * the log of the...
Read more >Log Loss Function Explained by Experts - Dasha.AI
Logarithmic loss indicates how close a prediction probability comes to the actual/corresponding true value. Here is the log loss formula: Binary ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The two most common causes for missing logs I see in environments is due to the log rotation settings being too small for high velocity logging. Please ensure docker or logrotate is set to sane production/high volume size. 10MB is not enough (default). More like 1GB/10gb, is probably more reasonable.
The other issue could be filling the buffer, though you would see that in the pod logs. Our customers who have stressed test it generally opt to use a fluentd file buffer to ensure they dont lose data if network is down or splunk is overloaded.
this is just a sample, exact configs will vary.
https://docs.fluentd.org/buffer/file
Let me know if these get you sorted. If not, if you open a support ticket and we can gather info about your enviro and assist!
We did see significant improvement by increasing our log rotation file size limits in the Kubernetes cluster.spec file. Thanks for hinting the enable_watch_time flag! We will play around with that. Can you elaborate more on the “inputs that account for rolled logs”. What does this actually mean?