HttpLogShipper starts hammering target after a certain amount of errors
See original GitHub issueIf sending log messages fails for prolonged periods of time (e.g. due to a wrong configuration or the target system being unreachable), the log shipper will start hammering the target every 2 seconds (in the default configuration).
This is due to an overflow in ExponentialBackoffConnectionSchedule#NextInterval
, where the cast in line 61 will overflow. In the default configuration with a period of 2 seconds backoffPeriod
will be 50000000 ticks (as per minimum backoff period of 5 seconds). When the number of errors reaches 39, the backoffFactor
becomes 2^38. The expression var backedOff = (long)(backoffPeriod * backoffFactor)
becomes (long)(2^38 * 50000000)
, which is greater than (long)(2^38 * 2^25)
(50000000 being > 2^25) or greater than 2^63. The cast to long
overflows and gives -9223372036854775808. Following through, line 67 results in the actual backoff being the base period (2 seconds in the default configuration). From that point on, this happens every time NextInterface.get
is called.
To Reproduce
- Configure the HTTP sink to log to an invalid target (e.g. a computer without a server running).
- Emit a log message and wait for 39 retries (roughly 6 hours, could be shortened by changing the
MaximumBackoffInterval
to something shorter, e.g. a few seconds). - See
HttpLogShippper
retrying every 2 seconds (in the default configuration).
Expected behavior
The retransmission timeout should stay capped to 10 minutes, even if the messages cannot be sent for prolonged periods of time (e.g. by capping failuresSinceSuccessfulConnection
to something reasonably small, so that the exponential function does not produce excessively large values).
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Thanks for the prompt reaction. We’re good. We just told our clients to get their sh*t together 😉 The issue starts filling up the log files after 6 hours of having a wrong configuration, so most of the time people start noticing way before, that they are not getting any remote log messages. This one actually popped up on a test system, which nobody paid any real attention to, so nothing critical.
Thanks for reporting it. Best of luck to you in the future!