Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HttpLogShipper starts hammering target after a certain amount of errors

See original GitHub issue

If sending log messages fails for prolonged periods of time (e.g. due to a wrong configuration or the target system being unreachable), the log shipper will start hammering the target every 2 seconds (in the default configuration).

This is due to an overflow in ExponentialBackoffConnectionSchedule#NextInterval, where the cast in line 61 will overflow. In the default configuration with a period of 2 seconds backoffPeriod will be 50000000 ticks (as per minimum backoff period of 5 seconds). When the number of errors reaches 39, the backoffFactor becomes 2^38. The expression var backedOff = (long)(backoffPeriod * backoffFactor) becomes (long)(2^38 * 50000000), which is greater than (long)(2^38 * 2^25) (50000000 being > 2^25) or greater than 2^63. The cast to long overflows and gives -9223372036854775808. Following through, line 67 results in the actual backoff being the base period (2 seconds in the default configuration). From that point on, this happens every time NextInterface.get is called.

To Reproduce

Configure the HTTP sink to log to an invalid target (e.g. a computer without a server running).
Emit a log message and wait for 39 retries (roughly 6 hours, could be shortened by changing the MaximumBackoffInterval to something shorter, e.g. a few seconds).
See HttpLogShippper retrying every 2 seconds (in the default configuration).

Expected behavior The retransmission timeout should stay capped to 10 minutes, even if the messages cannot be sent for prolonged periods of time (e.g. by capping failuresSinceSuccessfulConnection to something reasonably small, so that the exponential function does not produce excessively large values).

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

ghostcommented, Feb 13, 2020

Thanks for the prompt reaction. We’re good. We just told our clients to get their sh*t together 😉 The issue starts filling up the log files after 6 hours of having a wrong configuration, so most of the time people start noticing way before, that they are not getting any remote log messages. This one actually popped up on a test system, which nobody paid any real attention to, so nothing critical.

0reactions

FantasticFiascocommented, Feb 21, 2020

Thanks for reporting it. Best of luck to you in the future!