question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Graceful termination does not work with apache chart

See original GitHub issue

In apache/airflow, helm chart has worker default terminationGracePeriodSeconds: 600.

I observed after deploy using 1.10.14 that worker was terminated immediately. This reproduced consistently.

Tested also with 2.0.0 and again no lucke

Anyone have any hints of something to look into?

Here are some logs from a worker that shutdown ungracefully, running 1.10.14:

worker: Warm shutdown (MainProcess)
[2021-01-09 22:24:30,747: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:37 exited with 'signal 15 (SIGTERM)'
[2021-01-09 22:24:30,858: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/worker.py", line 208, in start
    self.blueprint.start(self)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 599, in start
    c.loop(*c.loop_args())
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop
    next(loop)
  File "/home/airflow/.local/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 308, in create_loop
    events = poll(poll_timeout)
  File "/home/airflow/.local/lib/python3.7/site-packages/kombu/utils/eventio.py", line 84, in poll
    return self._epoll.poll(timeout if timeout is not None else -1)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/apps/worker.py", line 285, in _handle_request
    raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
[2021-01-09 22:24:30,865: ERROR/MainProcess] Process 'ForkPoolWorker-16' pid:38 exited with 'signal 15 (SIGTERM)'

 -------------- celery@airflow-worker-66b7bf687b-8j2x5 v4.4.7 (cliffs)
--- ***** -----
-- ******* ---- Linux-4.14.209-160.335.amzn2.x86_64-x86_64-with-debian-10.6 2021-01-09 22:22:38
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app:         airflow.executors.celery_executor:0x7fc31160fd90
- ** ---------- .> transport:   redis://:**@airflow-redis:6379/0
- ** ---------- .> results:     postgresql://postgres:**@airflow-pgbouncer:6543/airflow-result-backend
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . airflow.executors.celery_executor.execute_command

And again with 2.0.0:

[2021-01-10 06:46:21,159: INFO/MainProcess] Connected to redis://:**@airflow-redis:6379/0
[2021-01-10 06:46:21,168: INFO/MainProcess] mingle: searching for neighbors
[2021-01-10 06:46:22,208: INFO/MainProcess] mingle: all alone
[2021-01-10 06:46:22,224: INFO/MainProcess] celery@airflow-worker-66b9b6495b-6m7jd ready.
[2021-01-10 06:46:25,199: INFO/MainProcess] Events of group {task} enabled by remote.
[2021-01-10 06:47:47,441: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[dab010cc-72fb-4f73-8c53-05b46ca71848]
[2021-01-10 06:47:47,503: INFO/ForkPoolWorker-7] Executing command in Celery: ['airflow', 'tasks', 'run', 'standish_test_dag', 'test-secrets-backend', '2021-01-10T06:44:35.573723+00:00', '--local', '--pool', 'default_pool', '--subdir', '/opt/airflow/dags/standish_test.py']
[2021-01-10 06:47:47,549: INFO/ForkPoolWorker-7] Filling up the DagBag from /opt/airflow/dags/standish_test.py
[2021-01-10 06:47:47,830: INFO/ForkPoolWorker-7] Loading 1 plugin(s) took 0.26 seconds
[2021-01-10 06:47:47,845: WARNING/ForkPoolWorker-7] Running <TaskInstance: standish_test_dag.test-secrets-backend 2021-01-10T06:44:35.573723+00:00 [queued]> on host 10.5.21.64
[2021-01-10 06:48:17,735] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/1.log HTTP/1.1" 404 -
[2021-01-10 06:48:17,738] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/2.log HTTP/1.1" 200 -

With 2.0.0 theres no error, but still it is immediate termination with no respecting of grace period.

I tried various combinations of args and saw the same behavior every time:

  • ["bash", "-c", "airflow worker"]
  • ["bash", "-c", "exec airflow worker"]
  • ["worker"]

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
dstandishcommented, Jan 11, 2021

I will take a look later this week. It also depends which command is used to run airflow components. You are.talking about the current master version of the ‘chart’ yeah ? No modification to the entrypoint or command ?

correct, no mods to entrypoint. You can see which things i tried in helm config above – diff values of args or command.

So if you expect the worker to terminate immeditaely you might have observed actually wrong behaviour where someone sent more than one SIGTERM to those workers (I’ve seen such setups) - but this is a rather bad idea IMHO

No, I do not want worker to terminate immediately. I want it to do what it is supposed to, namely warm shutdown – i.e. stop taking tasks, and run until either all tasks done or grace period has elapsed

1reaction
potiukcommented, Jan 11, 2021

I will take a look later this week. It also depends which command is used to run airflow components. You are.talking about the current master version of the ‘chart’ yeah ? No modification to the entrypoint or command ?

The dumb init and tini are equivalent and they are indeed there to forward signals to the running processes (this is really useful when you have a bash script as entrypoint (if you have bash as direct entrypoint then it will not forward signals to it’s children. There are two solutions to solve it:

A) dumb init or tini as entrypoint b) exec ‘binary’ at the end of the bash script (exec another bash won’t work)

Default entrypoint in prod image is dumb-init so it should propagate the signals properly, but as @xinbinhuang mentioned when you have celery worker it has a number of config options when you send a SIGTERM to it celery worker it will stop spawning new processes and wait for all the running tasks to terminate. So by definition the worker might take quite some time to exit. There is the termination grace period that controls how long it will take for the celery to wait for all processes to terminate before it will ‘kill -9’ and exits ‘non gracefully’.

Also there is another gotcha - if you send SECOND SIGTERM to such celery worker while it is waiting for tasks, it will terminate all the processes with ‘kill -9’ and will exit immediately.

So if you expect the worker to terminate immeditaely you might have observed actually wrong behaviour where someone sent more than one SIGTERM to those workers (I’ve seen such setups) - but this is a rather bad idea IMHO.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark Streaming Graceful Shutdown - Stack Overflow
1 Answer 1 · I would like to add that to accomplish graceful shutdown when running on yarn, one must provide a mechanism...
Read more >
Stop Apache gracefully - Server Fault
A “special” option to gracefully stop is only needed if you run a custom process manager that normally kills processes.
Read more >
Graceful shutdown in Kubernetes - HackerNoon
I've been doing a research on how to do graceful shutdown of HTTP services in Kubernetes. Surprisingly, I found contradictory opinions on ...
Read more >
Graceful Shutdown - Apache Camel
Its responsible for shutting down routes in a graceful manner. The other resources will still be handled by CamelContext to shutdown. This leaves...
Read more >
View the list of available chart parameters - VMware Docs
diagnosticMode.enabled, Enable diagnostic mode (all probes will be ... Seconds Airflow scheduler pod needs to terminate gracefully, "".
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found