Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature request: allow interval jobs to be offset by a random value

See original GitHub issue

When jobs are configured to execute at certain times, they tend to coalesce at one particular time. For example, executing 10 jobs at the top of the hour, another 10 once in at noon and once at midnight, and a third job that executes every minute will cause all of these jobs to run at the same time at noon and midnight, potentially causing a bottleneck in the number of threads.

This could be solved by introducing a new parameter to the interval trigger that would define a value in seconds by which each job would be offset.

Assuming these new imports:

from operator import add, sub
from random import randrange, choice

The constructor could have a new parameter value (random_offset) that would be at most equivalent to the configured interval:

self.random_offset = min(self.interval_length, random_offset)

Then, the get_next_fire_time() function could check if an offset is defined and modify the next execution time accordingly:

if self.random_offset:
    offset = timedelta(seconds=randrange(0, self.random_offset))
    op = choice((add, sub))
    next_fire_time = max(datetime.now(self.timezone), op(next_fire_time, offset))

I understand all this can be worked around by setting custom start times for each group of jobs so that they don’t intersect but having an easy-to-configure parameter would be cool too. The UNIX Anacron scheduler uses a similar mechanic with the RANDOM_DELAY option.

I did notice during testing of the above code that get_next_fire_time() is called both from _get_run_times() (in job.py) and _process_jobs in (in base.py), producing different times for each. I’m not sure if it’s significant that these functions both receive the same next fire time.

Anyway, is this something you’d consider merging if I were to submit a PR? Thank you!

Issue Analytics

State:
Created 7 years ago
Comments:10 (8 by maintainers)

Top GitHub Comments

2reactions

gilbsgilbscommented, Dec 7, 2017

Can you elaborate on this?

Let’s say I have 5 nodes and I deployed APScheduler on all of them. I use a memory jobstore and threaded executors. I have idempotent jobs so that I can execute them whenever I want on any of my nodes. I run a job every 10 minutes using an IntervalTrigger or a CronJob. This job calls an external API + performs some CPU intensive tasks.

I don’t want the loss of one node to compromise the health of the cluster.
On average, I don’t want my nodes to perform the API call at the exact same moment. This would cause a DoS.
If there’s a lock to take for example, I don’t want the same node to take it over and over again, and do all the work everytime. I want the work to be roughly evenly balanced across nodes. (e.g. If one node is unable to do the job for any reason, I want to be informed ASAP + in the case of CPU intensive operations, it’s generally a good idea to balance load => CPU credit)
I want to minimize the number of times I execute the same job on many nodes in the same time interval. It’s generally pointless.

The “lean” way to achieve this (I think) is to add a random offset to the job execution (as suggested by this issue) and randomize whether the job executes or not. It’s cheap yet good-enough HA and LB without involving a master election, a distributed cron system, a service discovery or such complex and expensive things.

But general case: you know how many sysadmins like to run their cron jobs at fixed non-sharp hours to “respect resources”. Having a random offset is just a clean way to address this issue and it’s probably what most people here are willing to see.

The only reason these back-ends were added is because I received well made PRs for both of them.

Given how cool APScheduler is, no doubt it attracts best contributors 😉 . Thanks for your time.

0reactions

mprpiccommented, Dec 7, 2017

Also, with the right kind of wrong settings you may end up calculating a next run time that is the same or earlier than the previous run time. This must never happen, so take that into consideration.

The implementation that I noted in the initial comment ensures this by always computing a time that is a minimum between the interval between two jobs and the configured random delay (jitter). So if you have jobs configured to run every 5 minutes but configure the random delay to be 10 minutes, the maximum time of the offset will still only be 5 minutes so as to not overrun to the next job or interrupt a previously scheduled job (if that makes sense).