Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Graceful termination does not work with apache chart

See original GitHub issue

In apache/airflow, helm chart has worker default terminationGracePeriodSeconds: 600.

I observed after deploy using 1.10.14 that worker was terminated immediately. This reproduced consistently.

Tested also with 2.0.0 and again no lucke

Anyone have any hints of something to look into?

Here are some logs from a worker that shutdown ungracefully, running 1.10.14:

worker: Warm shutdown (MainProcess)
[2021-01-09 22:24:30,747: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:37 exited with 'signal 15 (SIGTERM)'
[2021-01-09 22:24:30,858: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/worker.py", line 208, in start
    self.blueprint.start(self)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 599, in start
    c.loop(*c.loop_args())
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop
    next(loop)
  File "/home/airflow/.local/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 308, in create_loop
    events = poll(poll_timeout)
  File "/home/airflow/.local/lib/python3.7/site-packages/kombu/utils/eventio.py", line 84, in poll
    return self._epoll.poll(timeout if timeout is not None else -1)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/apps/worker.py", line 285, in _handle_request
    raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
[2021-01-09 22:24:30,865: ERROR/MainProcess] Process 'ForkPoolWorker-16' pid:38 exited with 'signal 15 (SIGTERM)'

 -------------- celery@airflow-worker-66b7bf687b-8j2x5 v4.4.7 (cliffs)
--- ***** -----
-- ******* ---- Linux-4.14.209-160.335.amzn2.x86_64-x86_64-with-debian-10.6 2021-01-09 22:22:38
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app:         airflow.executors.celery_executor:0x7fc31160fd90
- ** ---------- .> transport:   redis://:**@airflow-redis:6379/0
- ** ---------- .> results:     postgresql://postgres:**@airflow-pgbouncer:6543/airflow-result-backend
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . airflow.executors.celery_executor.execute_command

And again with 2.0.0:

[2021-01-10 06:46:21,159: INFO/MainProcess] Connected to redis://:**@airflow-redis:6379/0
[2021-01-10 06:46:21,168: INFO/MainProcess] mingle: searching for neighbors
[2021-01-10 06:46:22,208: INFO/MainProcess] mingle: all alone
[2021-01-10 06:46:22,224: INFO/MainProcess] celery@airflow-worker-66b9b6495b-6m7jd ready.
[2021-01-10 06:46:25,199: INFO/MainProcess] Events of group {task} enabled by remote.
[2021-01-10 06:47:47,441: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[dab010cc-72fb-4f73-8c53-05b46ca71848]
[2021-01-10 06:47:47,503: INFO/ForkPoolWorker-7] Executing command in Celery: ['airflow', 'tasks', 'run', 'standish_test_dag', 'test-secrets-backend', '2021-01-10T06:44:35.573723+00:00', '--local', '--pool', 'default_pool', '--subdir', '/opt/airflow/dags/standish_test.py']
[2021-01-10 06:47:47,549: INFO/ForkPoolWorker-7] Filling up the DagBag from /opt/airflow/dags/standish_test.py
[2021-01-10 06:47:47,830: INFO/ForkPoolWorker-7] Loading 1 plugin(s) took 0.26 seconds
[2021-01-10 06:47:47,845: WARNING/ForkPoolWorker-7] Running <TaskInstance: standish_test_dag.test-secrets-backend 2021-01-10T06:44:35.573723+00:00 [queued]> on host 10.5.21.64
[2021-01-10 06:48:17,735] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/1.log HTTP/1.1" 404 -
[2021-01-10 06:48:17,738] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/2.log HTTP/1.1" 200 -

With 2.0.0 theres no error, but still it is immediate termination with no respecting of grace period.

I tried various combinations of args and saw the same behavior every time:

["bash", "-c", "airflow worker"]
["bash", "-c", "exec airflow worker"]
["worker"]

Issue Analytics

State:
Created 3 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

dstandishcommented, Jan 11, 2021

I will take a look later this week. It also depends which command is used to run airflow components. You are.talking about the current master version of the ‘chart’ yeah ? No modification to the entrypoint or command ?

correct, no mods to entrypoint. You can see which things i tried in helm config above – diff values of args or command.

So if you expect the worker to terminate immeditaely you might have observed actually wrong behaviour where someone sent more than one SIGTERM to those workers (I’ve seen such setups) - but this is a rather bad idea IMHO

No, I do not want worker to terminate immediately. I want it to do what it is supposed to, namely warm shutdown – i.e. stop taking tasks, and run until either all tasks done or grace period has elapsed

1reaction

potiukcommented, Jan 11, 2021

I will take a look later this week. It also depends which command is used to run airflow components. You are.talking about the current master version of the ‘chart’ yeah ? No modification to the entrypoint or command ?

The dumb init and tini are equivalent and they are indeed there to forward signals to the running processes (this is really useful when you have a bash script as entrypoint (if you have bash as direct entrypoint then it will not forward signals to it’s children. There are two solutions to solve it:

A) dumb init or tini as entrypoint b) exec ‘binary’ at the end of the bash script (exec another bash won’t work)

Default entrypoint in prod image is dumb-init so it should propagate the signals properly, but as @xinbinhuang mentioned when you have celery worker it has a number of config options when you send a SIGTERM to it celery worker it will stop spawning new processes and wait for all the running tasks to terminate. So by definition the worker might take quite some time to exit. There is the termination grace period that controls how long it will take for the celery to wait for all processes to terminate before it will ‘kill -9’ and exits ‘non gracefully’.

Also there is another gotcha - if you send SECOND SIGTERM to such celery worker while it is waiting for tasks, it will terminate all the processes with ‘kill -9’ and will exit immediately.

So if you expect the worker to terminate immeditaely you might have observed actually wrong behaviour where someone sent more than one SIGTERM to those workers (I’ve seen such setups) - but this is a rather bad idea IMHO.