question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler drifts in cycles

See original GitHub issue

Dagster version

1.1.2

What’s the issue?

We are currently running stress tests on dagster to find out what are the limits when running dagster on k8s. One of the test was designed to show how precise is the dagster scheduler. We started with 40 of concurrently schedules jobs (run 40 sleep jobs every minute) and then we exploited tables event_log and run_tags to evaluate different start up intervals. It showed up that even though each job should be fired at 00 seconds, as expected from the cron definition, the fire event is actually delayed by a few seconds. This delay is drifting over time in range 1-6 seconds from the expected scheduled time. The delay is increasing in approx. 18 minute cycles followed by a drop to 1 second.

We distilled a minimal example with one sleep op, one job and one every-minute schedule and ran it on empty cluster, but the result is the same. There was no difference in setup with or without async thread pool, as you can see below (the first is sync, the second async). The blue line in both diagrams shows the difference between the scheduled time and the real time of job submission (calculated as PIPELINE_STARTING - .dagster/scheduled_execution_time).

mvp1per1minnothreads_latency_during_test

mvp1per1min_latency_during_test

The drift is notable even from the daemon log. Check the line at time 10:11:03.

dagster 2022-12-05 10:10:13 +0000 - dagster.daemon.SchedulerDaemon - INFO - No new tick times to evaluate for sleep_job_schedule
dagster 2022-12-05 10:10:43 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
dagster 2022-12-05 10:11:03 +0000 - dagster.daemon.SchedulerDaemon - INFO - Evaluating schedule `sleep_job_schedule` at 2022-12-05 10:11:00 +0000
dagster 2022-12-05 10:11:04 +0000 - dagster.daemon.SchedulerDaemon - INFO - Completed scheduled launch of run 26be178c-4160-472c-81b2-e35db7b4a988 for sleep_job_schedule
dagster 2022-12-05 10:11:18 +0000 - dagster.daemon.SchedulerDaemon - INFO - Checking for new runs for the following schedules: sleep_job_schedule  

What did you expect to happen?

We expect jobs to be executed on scheduled time.

How to reproduce?

  1. Deploy dagster using the official Helm chart with PostgreSQL database.
  2. Create custom repository with following job and repository
import time
from collections import Counter

from dagster import job, op, repository, ScheduleDefinition, in_process_executor

@op
def sleep_op():
    time.sleep(1)

@job
def sleep_job():
    sleep_op()

sleep_schedule = ScheduleDefinition(job=sleep_job, cron_schedule="*/1 * * * *")

@repository(default_executor_def=in_process_executor)
def repo():
    return [sleep_schedule]

  1. Deploy the user code deployment and activate the every-minute scheduler.
  2. Monitor the drift in dagster-daemon log

Deployment type

Dagster Helm chart

Deployment details

We are using the official helm chart, but we have separated the user code deployment in the dedicated chart (https://artifacthub.io/packages/helm/dagster/dagster-user-deployments). Jobs are submitted via K8SRunLauncher without any tweaks.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
gibsondancommented, Dec 5, 2022

I put out a change here that I believe will help with this specific benchmark at least, certainly in the case where there is a single schedule running: https://github.com/dagster-io/dagster/pull/10886/files

However, as the number of schedules in your Dagster instance increases, there would likely be some effect on latency as well. We have some levers that you can pull in the Helm chart to help with latency, like adding a threadpool that executes the schedules in parallel.

0reactions
gibsondancommented, Dec 5, 2022

Thanks for the context - that makes sense. I think the attached PR should make the specific cyclic behavior that you observed more consistent once it lands.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scheduler repeat time drift - Google Groups
I've got several daily tasks that over time go from running at say 10am to 10:30am - it seems like each consecutive run...
Read more >
Drift Analysis within the Scheduling Module - Proplanner
Drift Analysis allows the engineer to evaluate the impacts of temporary over-Takt cycle time work conditions on mixed-model production ...
Read more >
Scheduling periodic tasks and clock drift - java - Stack Overflow
Yes, clock drift can be an issue when using ScheduledThreadPoolExecutor . CronScheduler is specifically designed to be proof against clock drift. Example usage:...
Read more >
Automatically Book More Sales Meetings - Drift
SCHEDULE SALES MEETINGS INSTANTLY. With Drift Meetings, SDRs can instantly book meetings for AEs after qualifying leads in live chat. That means your...
Read more >
A cycle-to-cycle jitter plot of one- shot scheduling. The ...
A cycle-to-cycle jitter plot of one- shot scheduling. The occasional high-jitter points are due to the resychronization of the scheduler to the time...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found