Re-deploy scheduler tasks failing with SIGTERM on K8s executor
See original GitHub issueApache Airflow version: 2.1.0
Kubernetes version (if you are using kubernetes) (use kubectl version
): v1.18.17-gke.1901
Environment:
- Cloud provider or hardware configuration: Google Cloud
- OS (e.g. from /etc/os-release): Debian GNU/Linux 10 (buster)
- Kernel (e.g.
uname -a
): Linux airflow-scheduler-7697b66974-m6mct 5.4.89+ #1 SMP Sat Feb 13 19:45:14 PST 2021 x86_64 GNU/Linux - Install tools:
- Others:
What happened:
When the scheduler
is restarted the currently running tasks are facing SIGTERM error.
Every time the scheduler
is restarted or re-deployed then the current scheduler
is terminated and a new scheduler
is created. If during this process exist tasks running the new scheduler
will terminate these tasks with complete
status and new tasks will be created to continue the work of the terminated ones. After few seconds the new tasks are terminated with error
status and SIGTERM error.
Error log
[2021-07-07 14:59:49,024] {cursor.py:661} INFO - query execution done [2021-07-07 14:59:49,025] {arrow_result.pyx:0} INFO - fetching data done [2021-07-07 15:00:07,361] {local_task_job.py:196} WARNING - State of this instance has been externally set to failed. Terminating instance. [2021-07-07 15:00:07,363] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 150 [2021-07-07 15:00:12,845] {taskinstance.py:1264} ERROR - Received SIGTERM. Terminating subprocesses. [2021-07-07 15:00:12,907] {process_utils.py:66} INFO - Process psutil.Process(pid=150, status='terminated', exitcode=0, started='14:59:46') (150) terminated with exit code 0What you expected to happen:
The tasks currently running should be allowed to finish their process or the substitute tasks should execute their process with success.
The new scheduler
should interfere with the running tasks.
How to reproduce it:
To reproduce is necessary to start a DAG that has some task(s) that take some minutes to be completed. During this task(s) processing a new deploy for scheduler
should be executed. During the re-deploy, the current scheduler
will be terminated and a new one will be created. The current task(s) will be completed (without finish their processing) and substituted for new ones that will fail in seconds.
Anything else we need to know:
The problem was not happening with Airflow 1.10.15 and it started to happens after the upgrade to Airflow 2.1.0.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (2 by maintainers)
Top GitHub Comments
OK, let me retract - our problem was, seemingly, different. The
serviceaccount
provided to the scheduler was not bound to roles containing the verbpatch
for thepod
kind… so the issue is solved for us.@rodrigo-morais maybe it was the same issue for you, not sure. Apologies in advance for the noise, but let me share in case it helps you - In our case, the scheduler’s error message contained the following failed attempt to adopt pods on the new scheduler’s startup:
We’ve edited the
role
resource (check withkubectl get role
) by adding thepatch
verb to thepods
kind, which was missing as the error message said quite clearly indeed.Pods before to deploy a new scheduler
Pods after to deploy the scheduler
Pods after the current scheduler have been terminated and the new one has been created
Pod with error
Pod with scheduler
Log of the task
The UI:
The previous and these logs don’t have any error in the DB.