Scheduler drifts in cycles
See original GitHub issueDagster version
1.1.2
What’s the issue?
We are currently running stress tests on dagster to find out what are the limits when running dagster on k8s. One of the test was designed to show how precise is the dagster scheduler. We started with 40 of concurrently schedules jobs (run 40 sleep jobs every minute) and then we exploited tables event_log and run_tags to evaluate different start up intervals. It showed up that even though each job should be fired at 00 seconds, as expected from the cron definition, the fire event is actually delayed by a few seconds. This delay is drifting over time in range 1-6 seconds from the expected scheduled time. The delay is increasing in approx. 18 minute cycles followed by a drop to 1 second.
We distilled a minimal example with one sleep op, one job and one every-minute schedule and ran it on empty cluster, but the result is the same. There was no difference in setup with or without async thread pool, as you can see below (the first is sync, the second async). The blue line in both diagrams shows the difference between the scheduled time and the real time of job submission (calculated as PIPELINE_STARTING - .dagster/scheduled_execution_time).
The drift is notable even from the daemon log. Check the line at time 10:11:03.
dagster 2022-12-05 10:10:13 +0000 - dagster.daemon.SchedulerDaemon - INFO - No new tick times to evaluate for sleep_job_schedule
dagster 2022-12-05 10:10:43 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
dagster 2022-12-05 10:11:03 +0000 - dagster.daemon.SchedulerDaemon - INFO - Evaluating schedule `sleep_job_schedule` at 2022-12-05 10:11:00 +0000
dagster 2022-12-05 10:11:04 +0000 - dagster.daemon.SchedulerDaemon - INFO - Completed scheduled launch of run 26be178c-4160-472c-81b2-e35db7b4a988 for sleep_job_schedule
dagster 2022-12-05 10:11:18 +0000 - dagster.daemon.SchedulerDaemon - INFO - Checking for new runs for the following schedules: sleep_job_schedule
What did you expect to happen?
We expect jobs to be executed on scheduled time.
How to reproduce?
- Deploy dagster using the official Helm chart with PostgreSQL database.
- Create custom repository with following job and repository
import time
from collections import Counter
from dagster import job, op, repository, ScheduleDefinition, in_process_executor
@op
def sleep_op():
time.sleep(1)
@job
def sleep_job():
sleep_op()
sleep_schedule = ScheduleDefinition(job=sleep_job, cron_schedule="*/1 * * * *")
@repository(default_executor_def=in_process_executor)
def repo():
return [sleep_schedule]
- Deploy the user code deployment and activate the every-minute scheduler.
- Monitor the drift in dagster-daemon log
Deployment type
Dagster Helm chart
Deployment details
We are using the official helm chart, but we have separated the user code deployment in the dedicated chart (https://artifacthub.io/packages/helm/dagster/dagster-user-deployments). Jobs are submitted via K8SRunLauncher without any tweaks.
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
I put out a change here that I believe will help with this specific benchmark at least, certainly in the case where there is a single schedule running: https://github.com/dagster-io/dagster/pull/10886/files
However, as the number of schedules in your Dagster instance increases, there would likely be some effect on latency as well. We have some levers that you can pull in the Helm chart to help with latency, like adding a threadpool that executes the schedules in parallel.
Thanks for the context - that makes sense. I think the attached PR should make the specific cyclic behavior that you observed more consistent once it lands.