question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Airflow Scheduler continuously restarts when using the CeleryKubernetesExecutor

See original GitHub issue

Apache Airflow version

2.3.2 (latest released)

What happened

The scheduler crashes with the following exception. Once the scheduler crashes restarts will cause it to immediately crash again. To get scheduler back working. All dags must be paused and all tasks that are running need to have it’s state changed to up for retry. This is something we just started noticing after switching to the CeleryKubernetesExecutor.

[2022-06-16 20:12:04,535] {scheduler_job.py:1350} WARNING - Failing (3) jobs without heartbeat after 2022-06-16 20:07:04.512590+00:00
[2022-06-16 20:12:04,535] {scheduler_job.py:1358} ERROR - Detected zombie job: {'full_filepath': '/airflow-efs/dags/Scanner.py', 'msg': 'Detected <TaskInstance: lmnop-domain-scanner.Macadocious manual__2022-06-16T02:27:36.281445+00:00 [running]> as zombie', 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f96de2fc890>, 'is_failure_callback': True}
[2022-06-16 20:12:04,537] {scheduler_job.py:756} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
  File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 739, in _execute
    self._run_scheduler_loop()
  File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 839, in _run_scheduler_loop
    next_event = timers.run(blocking=False)
  File "/usr/local/lib/python3.7/sched.py", line 151, in run
    action(*argument, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat
    action(*args, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
    return func(*args, session=session, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1359, in _find_zombies
    self.executor.send_callback(request)
  File "/pyroot/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 218, in send_callback
    self.callback_sink.send(request)
  File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
    return func(*args, session=session, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/database_callback_sink.py", line 34, in send
    db_callback = DbCallbackRequest(callback=callback, priority_weight=10)
  File "<string>", line 4, in __init__
  File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 437, in _initialize_instance
    manager.dispatch.init_failure(self, args, kwargs)
  File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__
    with_traceback=exc_tb,
  File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 434, in _initialize_instance
    return manager.original_init(*mixed[1:], **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/models/db_callback_request.py", line 44, in __init__
    self.callback_data = callback.to_json()
  File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/callback_requests.py", line 79, in to_json
    return json.dumps(dict_obj)
  File "/usr/local/lib/python3.7/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/local/lib/python3.7/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
[2022-06-16 20:12:04,573] {kubernetes_executor.py:813} INFO - Shutting down Kubernetes executor
[2022-06-16 20:12:04,574] {kubernetes_executor.py:773} WARNING - Executor shutting down, will NOT run task=(TaskInstanceKey(dag_id='lmnop-processor', task_id='launch-xyz-pod', run_id='manual__2022-06-16T19:53:04.707461+00:00', try_number=1, map_index=-1), ['airflow', 'tasks', 'run', 'lmnop-processor', 'launch-xyz-pod', 'manual__2022-06-16T19:53:04.707461+00:00', '--local', '--subdir', 'DAGS_FOLDER/lmnop.py'], None, None)
[2022-06-16 20:12:04,574] {kubernetes_executor.py:773} WARNING - Executor shutting down, will NOT run task=(TaskInstanceKey(dag_id='lmnop-processor', task_id='launch-xyz-pod', run_id='manual__2022-06-16T19:53:04.831929+00:00', try_number=1, map_index=-1), ['airflow', 'tasks', 'run', 'lmnop-processor', 'launch-xyz-pod', 'manual__2022-06-16T19:53:04.831929+00:00', '--local', '--subdir', 'DAGS_FOLDER/lmnop.py'], None, None)
[2022-06-16 20:12:04,601] {scheduler_job.py:768} INFO - Exited execute loop
Traceback (most recent call last):
  File "/pyroot/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/pyroot/lib/python3.7/site-packages/airflow/__main__.py", line 38, in main
    args.func(args)
  File "/pyroot/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 51, in command
    return func(*args, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/utils/cli.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 75, in scheduler
    _run_scheduler_job(args=args)
  File "/pyroot/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 46, in _run_scheduler_job
    job.run()
  File "/pyroot/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 244, in run
    self._execute()
  File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 739, in _execute
    self._run_scheduler_loop()
  File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 839, in _run_scheduler_loop
    next_event = timers.run(blocking=False)
  File "/usr/local/lib/python3.7/sched.py", line 151, in run
    action(*argument, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat
    action(*args, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
    return func(*args, session=session, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1359, in _find_zombies
    self.executor.send_callback(request)
  File "/pyroot/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 218, in send_callback
    self.callback_sink.send(request)
  File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
    return func(*args, session=session, **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/database_callback_sink.py", line 34, in send
    db_callback = DbCallbackRequest(callback=callback, priority_weight=10)
  File "<string>", line 4, in __init__
  File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 437, in _initialize_instance
    manager.dispatch.init_failure(self, args, kwargs)
  File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__
    with_traceback=exc_tb,
  File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 434, in _initialize_instance
    return manager.original_init(*mixed[1:], **kwargs)
  File "/pyroot/lib/python3.7/site-packages/airflow/models/db_callback_request.py", line 44, in __init__
    self.callback_data = callback.to_json()
  File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/callback_requests.py", line 79, in to_json
    return json.dumps(dict_obj)
  File "/usr/local/lib/python3.7/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/local/lib/python3.7/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable

What you think should happen instead

The error itself seems like a minor issue and should not happen and easy to fix. But what seems like a bigger issue is how the scheduler was not able to recover on it’s own and was stuck in an endless restart loop.

How to reproduce

I’m not sure of the most simple step by step way to reproduce. But the conditions of my airflow workflow was about 4 active dags chugging through with about 50 max active runs and 50 concurrent each, with one dag set with 150 max active runs and 50 concurrent. ( not really that much )

The dag with the 150 max active runs is running the kubernetesExecutor create a pod in the local kubernetes environment. this I think is the reason we’re seeing this issue all of a sudden.

Hopefully this helps in potentially reproducing it.

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==3.4.0 apache-airflow-providers-celery==2.1.4 apache-airflow-providers-cncf-kubernetes==4.0.2 apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-postgres==4.1.0 apache-airflow-providers-redis==2.0.4 apache-airflow-providers-sqlite==2.1.3

Deployment

Other Docker-based deployment

Deployment details

we create our own airflow base images using the instructions provided on your site, here is a snippet of the code we use to install

RUN pip3 install "apache-airflow[statsd,aws,kubernetes,celery,redis,postgres,sentry]==${AIRFLOW_VERSION}" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-$AIRFLOW_VERSION/constraints-$PYTHON_VERSION.txt"

We then use this docker image for all of our airflow workers, scheduler, dagprocessor and airflow web This is managed through a custom helm script. Also we have incorporated the use of pgbouncer to manage db connections similar to the publicly available helm charts

Anything else

The problem seems to occur quite frequently. It makes the system completely unusable.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
potiukcommented, Jun 19, 2022

Could you please make a PR + test with it so that we could discuss it there @meetri ?

2reactions
meetricommented, Jun 18, 2022

Not sure how to or if it’s worth making a pull request for this, but here is the change I made to get airflow scheduler back working.

diff --git a/airflow/callbacks/callback_requests.py b/airflow/callbacks/callback_requests.py
index 8112589cd..ecf0ae1c2 100644
--- a/airflow/callbacks/callback_requests.py
+++ b/airflow/callbacks/callback_requests.py
@@ -16,6 +16,7 @@
 # under the License.

 import json
+from datetime import date, datetime
 from typing import TYPE_CHECKING, Optional

 if TYPE_CHECKING:
@@ -76,7 +77,13 @@ class TaskCallbackRequest(CallbackRequest):
     def to_json(self) -> str:
         dict_obj = self.__dict__.copy()
         dict_obj["simple_task_instance"] = dict_obj["simple_task_instance"].__dict__
-        return json.dumps(dict_obj)
+
+        def datetime_serializer(obj):
+            if isinstance(obj, (datetime, date)):
+                return obj.isoformat()
+            raise TypeError(f"Type {type(obj)} not serializable")
+
+        return json.dumps(dict_obj, default=datetime_serializer)

     @classmethod
     def from_json(cls, json_str: str):
Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] potiuk commented on issue #24538: Airflow ...
[GitHub] [airflow] potiuk commented on issue #24538: Airflow Scheduler continuously restarts when using the CeleryKubernetesExecutor.
Read more >
Restarting the airflow scheduler - Stack Overflow
Run ps aux | grep airflow and check if airflow webserver or airflow scheduler processes are running. If they are kill them and...
Read more >
Configuration Reference — Airflow Documentation
Celery task will report its status as 'started' when the task is executed by a worker. This is used in Airflow to keep...
Read more >
Issues · apache/airflow · GitHub
Airflow Scheduler continuously restarts when using the CeleryKubernetesExecutor area:scheduler/executor Scheduler or Executor related Issues kind:bug This ...
Read more >
How to Install Apache Airflow to Run Different Executors
While Kubernetes is responsible for managing the pod lifecycle, the Scheduler constantly keeps on checking the status of the task using ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found