question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

celery_executor becomes stuck if child process receives signal before reset_signals is called

See original GitHub issue

Apache Airflow version: 1.10.13 onwards (Any version that picked up #11278, including Airflow 2.0.* and 2.1.*)

Environment:

  • Cloud provider or hardware configuration: Any
  • OS (e.g. from /etc/os-release): Only tested on Debian Linux, but others may be affected too
  • Kernel (e.g. uname -a): Any
  • Install tools: Any
  • Others: Only celery_executor is affected

What happened: This was first reported here. airflow-scheduler sometimes stops heartbeating and stops scheduling any tasks with this last line in the log. This happen at random times, a few times a week. Happens more often if the scheduler machine is slow.

{scheduler_job.py:746} INFO - Exiting gracefully upon receiving signal 15

The problem is that sometimes the machine is slow, reset_signals() of one or more slow child processes is not yet called before other child processes send SIGTERM when they exit. As a result, the slow child processes respond to the SIGTERM as if they are the main scheduler process. Thus we see the Exiting gracefully upon receiving signal 15 in the scheduler log. Since the probability of this happening is very low, this issue is really difficult to reproduce reliably in production.

Related to #7935 Most likely caused by #11278

What you expected to happen: Scheduler should not become stuck

How to reproduce it:

Here’s a small reproducing example of the problem. There’s roughly 1/25 chance it will be stuck. Run it many times to see it happen.

#!/usr/bin/env python3.8
import os
import random
import signal
import time
from multiprocessing import Pool


def send_task_to_executor(arg):
    pass


def _exit_gracefully(signum, frame):
    print(f"{os.getpid()} Exiting gracefully upon receiving signal {signum}")


def register_signals():
    print(f"{os.getpid()} register_signals()")
    signal.signal(signal.SIGINT, _exit_gracefully)
    signal.signal(signal.SIGTERM, _exit_gracefully)
    signal.signal(signal.SIGUSR2, _exit_gracefully)


def reset_signals():
    if random.randint(0, 500) == 0:
        # This sleep statement here simulates the machine being busy
        print(f"{os.getpid()} is slow")
        time.sleep(0.1)
    signal.signal(signal.SIGINT, signal.SIG_DFL)
    signal.signal(signal.SIGTERM, signal.SIG_DFL)
    signal.signal(signal.SIGUSR2, signal.SIG_DFL)


if __name__ == "__main__":
    register_signals()

    task_tuples_to_send = list(range(20))
    sync_parallelism = 15
    chunksize = 5

    with Pool(processes=sync_parallelism, initializer=reset_signals) as pool:
        pool.map(
            send_task_to_executor,
            task_tuples_to_send,
            chunksize=chunksize,
        )


The reproducing example above can become stuck with a py-spy dump that looks exactly like what airflow scheduler does:

py-spy dump for the parent airflow scheduler process

Python v3.8.7

Thread 0x7FB54794E740 (active): "MainThread"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:47)
    join (multiprocessing/process.py:149)
    _terminate_pool (multiprocessing/pool.py:729)
    __call__ (multiprocessing/util.py:224)
    terminate (multiprocessing/pool.py:654)
    __exit__ (multiprocessing/pool.py:736)
    _send_tasks_to_celery (airflow/executors/celery_executor.py:331)
    _process_tasks (airflow/executors/celery_executor.py:272)
    trigger_tasks (airflow/executors/celery_executor.py:263)
    heartbeat (airflow/executors/base_executor.py:158)
    _run_scheduler_loop (airflow/jobs/scheduler_job.py:1388)
    _execute (airflow/jobs/scheduler_job.py:1284)
    run (airflow/jobs/base_job.py:237)
    scheduler (airflow/cli/commands/scheduler_command.py:63)
    wrapper (airflow/utils/cli.py:89)
    command (airflow/cli/cli_parser.py:48)
    main (airflow/__main__.py:40)
    <module> (airflow:8)

py-spy dump for the child airflow scheduler process

Python v3.8.7

Thread 16232 (idle): "MainThread"
    __enter__ (multiprocessing/synchronize.py:95)
    get (multiprocessing/queues.py:355)
    worker (multiprocessing/pool.py:114)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:315)
    _launch (multiprocessing/popen_fork.py:75)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:277)
    start (multiprocessing/process.py:121)
    _repopulate_pool_static (multiprocessing/pool.py:326)
    _repopulate_pool (multiprocessing/pool.py:303)
    __init__ (multiprocessing/pool.py:212)
    Pool (multiprocessing/context.py:119)
    _send_tasks_to_celery (airflow/executors/celery_executor.py:330)
    _process_tasks (airflow/executors/celery_executor.py:272)
    trigger_tasks (airflow/executors/celery_executor.py:263)
    heartbeat (airflow/executors/base_executor.py:158)
    _run_scheduler_loop (airflow/jobs/scheduler_job.py:1388)
    _execute (airflow/jobs/scheduler_job.py:1284)
    run (airflow/jobs/base_job.py:237)
    scheduler (airflow/cli/commands/scheduler_command.py:63)
    wrapper (airflow/utils/cli.py:89)
    command (airflow/cli/cli_parser.py:48)
    main (airflow/__main__.py:40)
    <module> (airflow:8)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:6
  • Comments:16 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
yuqian90commented, May 21, 2021

Just how slow does it have to be to happen? We can probably guard this by closing of the current pid when we register them, and checking that the signal is received by the same pid

Hi, @ashb it’s not clear to me how slow it must be exactly for this to happen. It looks like as long as some child processes are a fraction of a second slower than the others, they easily get into a deadlock when a SIGTERM is received. So even a transient slowness of a beefy machine can cause this to happen.

Here’s what I tried so far. Only the last method seems to fix the issue completely (i.e. we have to stop using multiprocessing.Pool):

  • Tried to reset the signal handler to signal.SIG_DFL in register_signals if the current process is a child process. This doesn’t help because the child process inherits the parent’s signal handler when it’s forked. Still hangs occasionally.
  • Tried to make _exit_gracefully a no-op if the current process is a child process. This isn’t sufficient. Still hangs occasionally.
  • Tried to change multiprocessing to use “spawn” instead of “fork” like some people suggested on the internet, it greatly reduced the chance of this issue happening. However, after running the reproducing example about 8000 times, it still happened. So it doesn’t fix the issue completely.
  • Replace multiprocessing.Pool with concurrent.futures.process.ProcessPoolExecutor. Once this is done, the reproducing example no longer hangs even after running it tens of thousands times.. So I put up PR #15989 which fixes the issue using this method.

From experience, multiprocessing.Pool is notorious for causing mysterious hangs like these. Using ProcessPoolExecutor does not cause the same problems. It has similar interface and uses similar underlying libraries. I don’t understand exactly why it fixes the issue, but in practice it always seems to help.

0reactions
potiukcommented, Dec 19, 2022

@zsdyx Can you please open a new issue for it with all the details in it - exactly what happens for you , what is the sequence of events which version, what deployment and all the logs ?

It’s extremely hard to understand if the problem you have is related or not to this closed issue. There is no easy reproduction steps because we have no idea what you have done to get to the state you got there, and how close it matches that other, already fixed issue. Just spend a bit more time in explaining your situation in detail - help us to help to solve the problem you have please.

Read more comments on GitHub >

github_iconTop Results From Across the Web

waitpid() — Wait for a specific child process to end - IBM
Suspends the calling process until a child process ends or is stopped. More precisely, waitpid() suspends the calling process until the system gets...
Read more >
Unsure how to find out whether the Child process is ...
My parent process should be able to ignore the SIGINT signal and; My child process, on the other hand, should perform the default...
Read more >
Catching SIGCHLD (System Interface Guide)
The signal can be caught and the exit status from the child process can be obtained by immediately calling wait(2) and wait3(3C). This...
Read more >
cush Meta FAQ - CS3214 Computer Systems - Spring 2022
Blocking here means to delay the delivery of a pending signal until some later point in ... When the shell calls fork(), a...
Read more >
signal | Microsoft Learn
If the signal handler returns, the calling process resumes execution immediately following the point at which it received the interrupt ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found