Airflow Scheduler continuously restarts when using the CeleryKubernetesExecutor
See original GitHub issueApache Airflow version
2.3.2 (latest released)
What happened
The scheduler crashes with the following exception. Once the scheduler crashes restarts will cause it to immediately crash again. To get scheduler back working. All dags must be paused and all tasks that are running need to have it’s state changed to up for retry. This is something we just started noticing after switching to the CeleryKubernetesExecutor.
[2022-06-16 20:12:04,535] {scheduler_job.py:1350} WARNING - Failing (3) jobs without heartbeat after 2022-06-16 20:07:04.512590+00:00
[2022-06-16 20:12:04,535] {scheduler_job.py:1358} ERROR - Detected zombie job: {'full_filepath': '/airflow-efs/dags/Scanner.py', 'msg': 'Detected <TaskInstance: lmnop-domain-scanner.Macadocious manual__2022-06-16T02:27:36.281445+00:00 [running]> as zombie', 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f96de2fc890>, 'is_failure_callback': True}
[2022-06-16 20:12:04,537] {scheduler_job.py:756} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 739, in _execute
self._run_scheduler_loop()
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 839, in _run_scheduler_loop
next_event = timers.run(blocking=False)
File "/usr/local/lib/python3.7/sched.py", line 151, in run
action(*argument, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat
action(*args, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
return func(*args, session=session, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1359, in _find_zombies
self.executor.send_callback(request)
File "/pyroot/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 218, in send_callback
self.callback_sink.send(request)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
return func(*args, session=session, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/database_callback_sink.py", line 34, in send
db_callback = DbCallbackRequest(callback=callback, priority_weight=10)
File "<string>", line 4, in __init__
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 437, in _initialize_instance
manager.dispatch.init_failure(self, args, kwargs)
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__
with_traceback=exc_tb,
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
raise exception
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 434, in _initialize_instance
return manager.original_init(*mixed[1:], **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/models/db_callback_request.py", line 44, in __init__
self.callback_data = callback.to_json()
File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/callback_requests.py", line 79, in to_json
return json.dumps(dict_obj)
File "/usr/local/lib/python3.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/local/lib/python3.7/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
[2022-06-16 20:12:04,573] {kubernetes_executor.py:813} INFO - Shutting down Kubernetes executor
[2022-06-16 20:12:04,574] {kubernetes_executor.py:773} WARNING - Executor shutting down, will NOT run task=(TaskInstanceKey(dag_id='lmnop-processor', task_id='launch-xyz-pod', run_id='manual__2022-06-16T19:53:04.707461+00:00', try_number=1, map_index=-1), ['airflow', 'tasks', 'run', 'lmnop-processor', 'launch-xyz-pod', 'manual__2022-06-16T19:53:04.707461+00:00', '--local', '--subdir', 'DAGS_FOLDER/lmnop.py'], None, None)
[2022-06-16 20:12:04,574] {kubernetes_executor.py:773} WARNING - Executor shutting down, will NOT run task=(TaskInstanceKey(dag_id='lmnop-processor', task_id='launch-xyz-pod', run_id='manual__2022-06-16T19:53:04.831929+00:00', try_number=1, map_index=-1), ['airflow', 'tasks', 'run', 'lmnop-processor', 'launch-xyz-pod', 'manual__2022-06-16T19:53:04.831929+00:00', '--local', '--subdir', 'DAGS_FOLDER/lmnop.py'], None, None)
[2022-06-16 20:12:04,601] {scheduler_job.py:768} INFO - Exited execute loop
Traceback (most recent call last):
File "/pyroot/bin/airflow", line 8, in <module>
sys.exit(main())
File "/pyroot/lib/python3.7/site-packages/airflow/__main__.py", line 38, in main
args.func(args)
File "/pyroot/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 51, in command
return func(*args, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/cli.py", line 99, in wrapper
return f(*args, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 75, in scheduler
_run_scheduler_job(args=args)
File "/pyroot/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 46, in _run_scheduler_job
job.run()
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 244, in run
self._execute()
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 739, in _execute
self._run_scheduler_loop()
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 839, in _run_scheduler_loop
next_event = timers.run(blocking=False)
File "/usr/local/lib/python3.7/sched.py", line 151, in run
action(*argument, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat
action(*args, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
return func(*args, session=session, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1359, in _find_zombies
self.executor.send_callback(request)
File "/pyroot/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 218, in send_callback
self.callback_sink.send(request)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
return func(*args, session=session, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/database_callback_sink.py", line 34, in send
db_callback = DbCallbackRequest(callback=callback, priority_weight=10)
File "<string>", line 4, in __init__
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 437, in _initialize_instance
manager.dispatch.init_failure(self, args, kwargs)
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__
with_traceback=exc_tb,
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
raise exception
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 434, in _initialize_instance
return manager.original_init(*mixed[1:], **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/models/db_callback_request.py", line 44, in __init__
self.callback_data = callback.to_json()
File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/callback_requests.py", line 79, in to_json
return json.dumps(dict_obj)
File "/usr/local/lib/python3.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/local/lib/python3.7/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
What you think should happen instead
The error itself seems like a minor issue and should not happen and easy to fix. But what seems like a bigger issue is how the scheduler was not able to recover on it’s own and was stuck in an endless restart loop.
How to reproduce
I’m not sure of the most simple step by step way to reproduce. But the conditions of my airflow workflow was about 4 active dags chugging through with about 50 max active runs and 50 concurrent each, with one dag set with 150 max active runs and 50 concurrent. ( not really that much )
The dag with the 150 max active runs is running the kubernetesExecutor create a pod in the local kubernetes environment. this I think is the reason we’re seeing this issue all of a sudden.
Hopefully this helps in potentially reproducing it.
Operating System
Debian GNU/Linux 10 (buster)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==3.4.0 apache-airflow-providers-celery==2.1.4 apache-airflow-providers-cncf-kubernetes==4.0.2 apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-postgres==4.1.0 apache-airflow-providers-redis==2.0.4 apache-airflow-providers-sqlite==2.1.3
Deployment
Other Docker-based deployment
Deployment details
we create our own airflow base images using the instructions provided on your site, here is a snippet of the code we use to install
RUN pip3 install "apache-airflow[statsd,aws,kubernetes,celery,redis,postgres,sentry]==${AIRFLOW_VERSION}" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-$AIRFLOW_VERSION/constraints-$PYTHON_VERSION.txt"
We then use this docker image for all of our airflow workers, scheduler, dagprocessor and airflow web This is managed through a custom helm script. Also we have incorporated the use of pgbouncer to manage db connections similar to the publicly available helm charts
Anything else
The problem seems to occur quite frequently. It makes the system completely unusable.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Comments:12 (7 by maintainers)

Top Related StackOverflow Question
Could you please make a PR + test with it so that we could discuss it there @meetri ?
Not sure how to or if it’s worth making a pull request for this, but here is the change I made to get airflow scheduler back working.