Graceful termination does not work with apache chart
See original GitHub issueIn apache/airflow, helm chart has worker default terminationGracePeriodSeconds: 600
.
I observed after deploy using 1.10.14 that worker was terminated immediately. This reproduced consistently.
Tested also with 2.0.0 and again no lucke
Anyone have any hints of something to look into?
Here are some logs from a worker that shutdown ungracefully, running 1.10.14:
worker: Warm shutdown (MainProcess)
[2021-01-09 22:24:30,747: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:37 exited with 'signal 15 (SIGTERM)'
[2021-01-09 22:24:30,858: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/worker.py", line 208, in start
self.blueprint.start(self)
File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 599, in start
c.loop(*c.loop_args())
File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop
next(loop)
File "/home/airflow/.local/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 308, in create_loop
events = poll(poll_timeout)
File "/home/airflow/.local/lib/python3.7/site-packages/kombu/utils/eventio.py", line 84, in poll
return self._epoll.poll(timeout if timeout is not None else -1)
File "/home/airflow/.local/lib/python3.7/site-packages/celery/apps/worker.py", line 285, in _handle_request
raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
[2021-01-09 22:24:30,865: ERROR/MainProcess] Process 'ForkPoolWorker-16' pid:38 exited with 'signal 15 (SIGTERM)'
-------------- celery@airflow-worker-66b7bf687b-8j2x5 v4.4.7 (cliffs)
--- ***** -----
-- ******* ---- Linux-4.14.209-160.335.amzn2.x86_64-x86_64-with-debian-10.6 2021-01-09 22:22:38
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: airflow.executors.celery_executor:0x7fc31160fd90
- ** ---------- .> transport: redis://:**@airflow-redis:6379/0
- ** ---------- .> results: postgresql://postgres:**@airflow-pgbouncer:6543/airflow-result-backend
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> celery exchange=celery(direct) key=celery
[tasks]
. airflow.executors.celery_executor.execute_command
And again with 2.0.0:
[2021-01-10 06:46:21,159: INFO/MainProcess] Connected to redis://:**@airflow-redis:6379/0
[2021-01-10 06:46:21,168: INFO/MainProcess] mingle: searching for neighbors
[2021-01-10 06:46:22,208: INFO/MainProcess] mingle: all alone
[2021-01-10 06:46:22,224: INFO/MainProcess] celery@airflow-worker-66b9b6495b-6m7jd ready.
[2021-01-10 06:46:25,199: INFO/MainProcess] Events of group {task} enabled by remote.
[2021-01-10 06:47:47,441: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[dab010cc-72fb-4f73-8c53-05b46ca71848]
[2021-01-10 06:47:47,503: INFO/ForkPoolWorker-7] Executing command in Celery: ['airflow', 'tasks', 'run', 'standish_test_dag', 'test-secrets-backend', '2021-01-10T06:44:35.573723+00:00', '--local', '--pool', 'default_pool', '--subdir', '/opt/airflow/dags/standish_test.py']
[2021-01-10 06:47:47,549: INFO/ForkPoolWorker-7] Filling up the DagBag from /opt/airflow/dags/standish_test.py
[2021-01-10 06:47:47,830: INFO/ForkPoolWorker-7] Loading 1 plugin(s) took 0.26 seconds
[2021-01-10 06:47:47,845: WARNING/ForkPoolWorker-7] Running <TaskInstance: standish_test_dag.test-secrets-backend 2021-01-10T06:44:35.573723+00:00 [queued]> on host 10.5.21.64
[2021-01-10 06:48:17,735] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/1.log HTTP/1.1" 404 -
[2021-01-10 06:48:17,738] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/2.log HTTP/1.1" 200 -
With 2.0.0 theres no error, but still it is immediate termination with no respecting of grace period.
I tried various combinations of args
and saw the same behavior every time:
["bash", "-c", "airflow worker"]
["bash", "-c", "exec airflow worker"]
["worker"]
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (9 by maintainers)
Top Results From Across the Web
Spark Streaming Graceful Shutdown - Stack Overflow
1 Answer 1 · I would like to add that to accomplish graceful shutdown when running on yarn, one must provide a mechanism...
Read more >Stop Apache gracefully - Server Fault
A “special” option to gracefully stop is only needed if you run a custom process manager that normally kills processes.
Read more >Graceful shutdown in Kubernetes - HackerNoon
I've been doing a research on how to do graceful shutdown of HTTP services in Kubernetes. Surprisingly, I found contradictory opinions on ...
Read more >Graceful Shutdown - Apache Camel
Its responsible for shutting down routes in a graceful manner. The other resources will still be handled by CamelContext to shutdown. This leaves...
Read more >View the list of available chart parameters - VMware Docs
diagnosticMode.enabled, Enable diagnostic mode (all probes will be ... Seconds Airflow scheduler pod needs to terminate gracefully, "".
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
correct, no mods to entrypoint. You can see which things i tried in helm config above – diff values of args or command.
No, I do not want worker to terminate immediately. I want it to do what it is supposed to, namely warm shutdown – i.e. stop taking tasks, and run until either all tasks done or grace period has elapsed
I will take a look later this week. It also depends which command is used to run airflow components. You are.talking about the current master version of the ‘chart’ yeah ? No modification to the entrypoint or command ?
The dumb init and tini are equivalent and they are indeed there to forward signals to the running processes (this is really useful when you have a bash script as entrypoint (if you have bash as direct entrypoint then it will not forward signals to it’s children. There are two solutions to solve it:
A) dumb init or tini as entrypoint b) exec ‘binary’ at the end of the bash script (exec another bash won’t work)
Default entrypoint in prod image is dumb-init so it should propagate the signals properly, but as @xinbinhuang mentioned when you have celery worker it has a number of config options when you send a SIGTERM to it celery worker it will stop spawning new processes and wait for all the running tasks to terminate. So by definition the worker might take quite some time to exit. There is the termination grace period that controls how long it will take for the celery to wait for all processes to terminate before it will ‘kill -9’ and exits ‘non gracefully’.
Also there is another gotcha - if you send SECOND SIGTERM to such celery worker while it is waiting for tasks, it will terminate all the processes with ‘kill -9’ and will exit immediately.
So if you expect the worker to terminate immeditaely you might have observed actually wrong behaviour where someone sent more than one SIGTERM to those workers (I’ve seen such setups) - but this is a rather bad idea IMHO.