Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler drifts in cycles

See original GitHub issue

Dagster version

1.1.2

What’s the issue?

We are currently running stress tests on dagster to find out what are the limits when running dagster on k8s. One of the test was designed to show how precise is the dagster scheduler. We started with 40 of concurrently schedules jobs (run 40 sleep jobs every minute) and then we exploited tables event_log and run_tags to evaluate different start up intervals. It showed up that even though each job should be fired at 00 seconds, as expected from the cron definition, the fire event is actually delayed by a few seconds. This delay is drifting over time in range 1-6 seconds from the expected scheduled time. The delay is increasing in approx. 18 minute cycles followed by a drop to 1 second.

We distilled a minimal example with one sleep op, one job and one every-minute schedule and ran it on empty cluster, but the result is the same. There was no difference in setup with or without async thread pool, as you can see below (the first is sync, the second async). The blue line in both diagrams shows the difference between the scheduled time and the real time of job submission (calculated as PIPELINE_STARTING - .dagster/scheduled_execution_time).

mvp1per1minnothreads_latency_during_test

mvp1per1min_latency_during_test

The drift is notable even from the daemon log. Check the line at time 10:11:03.

dagster 2022-12-05 10:10:13 +0000 - dagster.daemon.SchedulerDaemon - INFO - No new tick times to evaluate for sleep_job_schedule
dagster 2022-12-05 10:10:43 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
dagster 2022-12-05 10:11:03 +0000 - dagster.daemon.SchedulerDaemon - INFO - Evaluating schedule `sleep_job_schedule` at 2022-12-05 10:11:00 +0000
dagster 2022-12-05 10:11:04 +0000 - dagster.daemon.SchedulerDaemon - INFO - Completed scheduled launch of run 26be178c-4160-472c-81b2-e35db7b4a988 for sleep_job_schedule
dagster 2022-12-05 10:11:18 +0000 - dagster.daemon.SchedulerDaemon - INFO - Checking for new runs for the following schedules: sleep_job_schedule

What did you expect to happen?

We expect jobs to be executed on scheduled time.

How to reproduce?

Deploy dagster using the official Helm chart with PostgreSQL database.
Create custom repository with following job and repository

import time
from collections import Counter

from dagster import job, op, repository, ScheduleDefinition, in_process_executor

@op
def sleep_op():
    time.sleep(1)

@job
def sleep_job():
    sleep_op()

sleep_schedule = ScheduleDefinition(job=sleep_job, cron_schedule="*/1 * * * *")

@repository(default_executor_def=in_process_executor)
def repo():
    return [sleep_schedule]

Deploy the user code deployment and activate the every-minute scheduler.
Monitor the drift in dagster-daemon log

Deployment type

Dagster Helm chart

Deployment details

We are using the official helm chart, but we have separated the user code deployment in the dedicated chart (https://artifacthub.io/packages/helm/dagster/dagster-user-deployments). Jobs are submitted via K8SRunLauncher without any tweaks.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

Issue Analytics

State:
Created 10 months ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

gibsondancommented, Dec 5, 2022

I put out a change here that I believe will help with this specific benchmark at least, certainly in the case where there is a single schedule running: https://github.com/dagster-io/dagster/pull/10886/files

However, as the number of schedules in your Dagster instance increases, there would likely be some effect on latency as well. We have some levers that you can pull in the Helm chart to help with latency, like adding a threadpool that executes the schedules in parallel.

0reactions

gibsondancommented, Dec 5, 2022

Thanks for the context - that makes sense. I think the attached PR should make the specific cyclic behavior that you observed more consistent once it lands.

Top Results From Across the Web

Scheduler repeat time drift - Google Groups

I've got several daily tasks that over time go from running at say 10am to 10:30am - it seems like each consecutive run...

Drift Analysis within the Scheduling Module - Proplanner

Drift Analysis allows the engineer to evaluate the impacts of temporary over-Takt cycle time work conditions on mixed-model production ...

Scheduling periodic tasks and clock drift - java - Stack Overflow

Yes, clock drift can be an issue when using ScheduledThreadPoolExecutor . CronScheduler is specifically designed to be proof against clock drift. Example usage:...

Automatically Book More Sales Meetings - Drift

SCHEDULE SALES MEETINGS INSTANTLY. With Drift Meetings, SDRs can instantly book meetings for AEs after qualifying leads in live chat. That means your...

A cycle-to-cycle jitter plot of one- shot scheduling. The ...

A cycle-to-cycle jitter plot of one- shot scheduling. The occasional high-jitter points are due to the resychronization of the scheduler to the time...