question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Forked processes aren't logged by tasks using the Python Operator

See original GitHub issue

Apache Airflow version: 2.0.0

Environment:

  • Cloud provider or hardware configuration: 2Ghz Quad Core i5
  • OS (e.g. from /etc/os-release): macOS Big Sur 11.2.3
  • Kernel (e.g. uname -a): 20.3.0 Darwin Kernel Version
  • Container OS: Debian 10.8 4.19.121-linuxkit
  • Install tools: Docker compose

What happened: When having the Python operator run a subprocess the fork is never logged to the task logs.

[2021-03-15 20:18:12,844] {taskinstance.py:1063} INFO - Executing <Task(PythonOperator): sub> on 2021-03-15T20:13:45.710628+00:00
[2021-03-15 20:18:12,849] {standard_task_runner.py:52} INFO - Started process 400 to run task
[2021-03-15 20:18:12,853] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'sub_processing', 'sub', '2021-03-15T20:13:45.710628+00:00', '--job-id', '20', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/sub-process-2.py', '--cfg-path', '/tmp/tmpmspwuqh1', '--error-file', '/tmp/tmp75f6t8bp']
[2021-03-15 20:18:12,854] {standard_task_runner.py:77} INFO - Job 20: Subtask sub
[2021-03-15 20:18:12,898] {logging_mixin.py:103} INFO - Running <TaskInstance: sub_processing.sub 2021-03-15T20:13:45.710628+00:00 [running]> on host 5af90102472a
[2021-03-15 20:18:12,937] {taskinstance.py:1256} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=sub_processing
AIRFLOW_CTX_TASK_ID=sub
AIRFLOW_CTX_EXECUTION_DATE=2021-03-15T20:13:45.710628+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2021-03-15T20:13:45.710628+00:00
[2021-03-15 20:18:23,003] {logging_mixin.py:103} INFO - Task Completed With Success
[2021-03-15 20:18:23,004] {python.py:118} INFO - Done. Returned value was: None
[2021-03-15 20:18:23,015] {taskinstance.py:1166} INFO - Marking task as SUCCESS. dag_id=sub_processing, task_id=sub, execution_date=20210315T201345, start_date=20210315T201812, end_date=20210315T201823
[2021-03-15 20:18:23,055] {taskinstance.py:1219} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-03-15 20:18:23,069] {local_task_job.py:142} INFO - Task exited with return code 0

What you expected to happen: Subtask is logged

[2021-03-15 20:43:28,520] {taskinstance.py:1063} INFO - Executing <Task(PythonOperator): sub> on 2021-03-15T20:13:45.710628+00:00
[2021-03-15 20:43:28,520] {base_task_runner.py:133} INFO - Running on host: 49fb1f13cbae
[2021-03-15 20:43:28,521] {base_task_runner.py:134} INFO - Running: ['airflow', 'tasks', 'run', 'sub_processing', 'sub', '2021-03-15T20:13:45.710628+00:00', '--job-id', '28', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/sub-process-2.py', '--cfg-path', '/tmp/tmpfukw5nds', '--error-file', '/tmp/tmp18xkzu_9']
[2021-03-15 20:43:30,035] {base_task_runner.py:118} INFO - Job 28: Subtask sub [2021-03-15 20:43:30,035] {plugins_manager.py:286} INFO - Loading 2 plugin(s) took 1.00 seconds
[2021-03-15 20:43:30,140] {base_task_runner.py:118} INFO - Job 28: Subtask sub [2021-03-15 20:43:30,139] {dagbag.py:440} INFO - Filling up the DagBag from /usr/local/airflow/dags/sub-process-2.py
[2021-03-15 20:43:30,196] {base_task_runner.py:118} INFO - Job 28: Subtask sub Running <TaskInstance: sub_processing.sub 2021-03-15T20:13:45.710628+00:00 [running]> on host 49fb1f13cbae
[2021-03-15 20:43:30,233] {taskinstance.py:1256} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=sub_processing
AIRFLOW_CTX_TASK_ID=sub
AIRFLOW_CTX_EXECUTION_DATE=2021-03-15T20:13:45.710628+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2021-03-15T20:13:45.710628+00:00
[2021-03-15 20:43:40,289] {base_task_runner.py:118} INFO - Job 28: Subtask sub Mon Mar 15 20:43:30 2021
[2021-03-15 20:43:40,289] {base_task_runner.py:118} INFO - Job 28: Subtask sub Mon Mar 15 20:43:35 2021
[2021-03-15 20:43:40,294] {logging_mixin.py:103} INFO - Task Completed With Success
[2021-03-15 20:43:40,295] {python.py:118} INFO - Done. Returned value was: None
[2021-03-15 20:43:40,312] {taskinstance.py:1166} INFO - Marking task as SUCCESS. dag_id=sub_processing, task_id=sub, execution_date=20210315T201345, start_date=20210315T204328, end_date=20210315T204340
[2021-03-15 20:43:40,339] {taskinstance.py:1219} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-03-15 20:43:40,605] {local_task_job.py:142} INFO - Task exited with return code 0

How to reproduce it:

  1. Create a DAG using the Python Operator that calls another python script. An example of such a DAG is:
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import subprocess
import shlex

default_args = {
 'start_date': datetime(2021, 1, 1),
 'catchup': False
}

def _sub_process():
    cmd = "python /usr/local/airflow/easy.py"
    ret = subprocess.run(shlex.split(cmd)).returncode
    if ret == 0:
        print("Task Completed With Success")
    else:
        print("Error")

with DAG('sub_processing', 
    schedule_interval='@hourly', 
    default_args=default_args) as dag:
    #Define tasks/operators

    sub = PythonOperator(
        task_id='sub',
        python_callable=_sub_process
    )
  1. Create a script that is named easy.py that prints time:
import time
import sys

maybe_fail = round(time.time() * 1000)

if maybe_fail % 5 == 0:
    for count in range(30):
        print(time.ctime())
        time.sleep(5)
else :
    for count in range(2):
        print(time.ctime())
        time.sleep(5)
  1. Run the DAG and view the output of the tasks.

How often does this problem occur? Once? Every time etc? Every time unless CAN_FORK = False in standard_task_runner.py

cc @kaxil

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
xinbinhuangcommented, Apr 16, 2021

Hi @rbankston, I don’t think this is an airflow bug, but just how the subprocess and the logging module work. In a word, you need to propagate back the stdout from your subprocess back in order to log in airflow. There are two options:

  1. propagate the stdout yourself, i.e.:
    ...
    p = subprocess.run(shlex.split(cmd), capture_output=True)
    print(p.stdout)
    ret = p.returncode
    ...

In this case, the p.stdout is a simple string and you may need to parse it with some extra logic.

  1. Recommended: Use the SubprocessHook. This will automatically capture the stdout and log them in airflow for you.
sub_result:  SubprocessResult = SubprocessHook().run_command(shlex.split(cmd))

print(sub_result.exit_code)

# this is only necessary if you want to do extra stuff with the log other than logging in airflow
print(sub_result.output)

cc: @dstandish

0reactions
raphaelauvcommented, Nov 23, 2021

@kaxil you can apparently remove that issue from 2.3.0 https://github.com/apache/airflow/milestone/36?closed=1

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] xinbinhuang commented on issue #14813: Forked ...
[GitHub] [airflow] xinbinhuang commented on issue #14813: Forked processes aren't logged by tasks using the Python Operator · 2021-05-07 Thread GitBox.
Read more >
Python logging from multiple processes - Stack Overflow
This creates 2 file table entries and one v-node table entry. Opening a file then forking isn't supposed to work, but it seems...
Read more >
multiprocessing — Process-based parallelism — Python 3.11 ...
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process....
Read more >
Using fork() to produce 1 parent and its 3 child processes
Program to create four processes (1 parent and 3 children) where they terminates in a sequence as follows : (a) Parent process terminates...
Read more >
Tasks — Airflow Documentation
Same definition applies to downstream task, which needs to be a direct child of the other task. There are two ways of declaring...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found