Forked processes aren't logged by tasks using the Python Operator
See original GitHub issueApache Airflow version: 2.0.0
Environment:
- Cloud provider or hardware configuration: 2Ghz Quad Core i5
- OS (e.g. from /etc/os-release): macOS Big Sur 11.2.3
- Kernel (e.g.
uname -a
): 20.3.0 Darwin Kernel Version - Container OS: Debian 10.8 4.19.121-linuxkit
- Install tools: Docker compose
What happened: When having the Python operator run a subprocess the fork is never logged to the task logs.
[2021-03-15 20:18:12,844] {taskinstance.py:1063} INFO - Executing <Task(PythonOperator): sub> on 2021-03-15T20:13:45.710628+00:00
[2021-03-15 20:18:12,849] {standard_task_runner.py:52} INFO - Started process 400 to run task
[2021-03-15 20:18:12,853] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'sub_processing', 'sub', '2021-03-15T20:13:45.710628+00:00', '--job-id', '20', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/sub-process-2.py', '--cfg-path', '/tmp/tmpmspwuqh1', '--error-file', '/tmp/tmp75f6t8bp']
[2021-03-15 20:18:12,854] {standard_task_runner.py:77} INFO - Job 20: Subtask sub
[2021-03-15 20:18:12,898] {logging_mixin.py:103} INFO - Running <TaskInstance: sub_processing.sub 2021-03-15T20:13:45.710628+00:00 [running]> on host 5af90102472a
[2021-03-15 20:18:12,937] {taskinstance.py:1256} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=sub_processing
AIRFLOW_CTX_TASK_ID=sub
AIRFLOW_CTX_EXECUTION_DATE=2021-03-15T20:13:45.710628+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2021-03-15T20:13:45.710628+00:00
[2021-03-15 20:18:23,003] {logging_mixin.py:103} INFO - Task Completed With Success
[2021-03-15 20:18:23,004] {python.py:118} INFO - Done. Returned value was: None
[2021-03-15 20:18:23,015] {taskinstance.py:1166} INFO - Marking task as SUCCESS. dag_id=sub_processing, task_id=sub, execution_date=20210315T201345, start_date=20210315T201812, end_date=20210315T201823
[2021-03-15 20:18:23,055] {taskinstance.py:1219} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-03-15 20:18:23,069] {local_task_job.py:142} INFO - Task exited with return code 0
What you expected to happen: Subtask is logged
[2021-03-15 20:43:28,520] {taskinstance.py:1063} INFO - Executing <Task(PythonOperator): sub> on 2021-03-15T20:13:45.710628+00:00
[2021-03-15 20:43:28,520] {base_task_runner.py:133} INFO - Running on host: 49fb1f13cbae
[2021-03-15 20:43:28,521] {base_task_runner.py:134} INFO - Running: ['airflow', 'tasks', 'run', 'sub_processing', 'sub', '2021-03-15T20:13:45.710628+00:00', '--job-id', '28', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/sub-process-2.py', '--cfg-path', '/tmp/tmpfukw5nds', '--error-file', '/tmp/tmp18xkzu_9']
[2021-03-15 20:43:30,035] {base_task_runner.py:118} INFO - Job 28: Subtask sub [2021-03-15 20:43:30,035] {plugins_manager.py:286} INFO - Loading 2 plugin(s) took 1.00 seconds
[2021-03-15 20:43:30,140] {base_task_runner.py:118} INFO - Job 28: Subtask sub [2021-03-15 20:43:30,139] {dagbag.py:440} INFO - Filling up the DagBag from /usr/local/airflow/dags/sub-process-2.py
[2021-03-15 20:43:30,196] {base_task_runner.py:118} INFO - Job 28: Subtask sub Running <TaskInstance: sub_processing.sub 2021-03-15T20:13:45.710628+00:00 [running]> on host 49fb1f13cbae
[2021-03-15 20:43:30,233] {taskinstance.py:1256} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=sub_processing
AIRFLOW_CTX_TASK_ID=sub
AIRFLOW_CTX_EXECUTION_DATE=2021-03-15T20:13:45.710628+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2021-03-15T20:13:45.710628+00:00
[2021-03-15 20:43:40,289] {base_task_runner.py:118} INFO - Job 28: Subtask sub Mon Mar 15 20:43:30 2021
[2021-03-15 20:43:40,289] {base_task_runner.py:118} INFO - Job 28: Subtask sub Mon Mar 15 20:43:35 2021
[2021-03-15 20:43:40,294] {logging_mixin.py:103} INFO - Task Completed With Success
[2021-03-15 20:43:40,295] {python.py:118} INFO - Done. Returned value was: None
[2021-03-15 20:43:40,312] {taskinstance.py:1166} INFO - Marking task as SUCCESS. dag_id=sub_processing, task_id=sub, execution_date=20210315T201345, start_date=20210315T204328, end_date=20210315T204340
[2021-03-15 20:43:40,339] {taskinstance.py:1219} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-03-15 20:43:40,605] {local_task_job.py:142} INFO - Task exited with return code 0
How to reproduce it:
- Create a DAG using the Python Operator that calls another python script. An example of such a DAG is:
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import subprocess
import shlex
default_args = {
'start_date': datetime(2021, 1, 1),
'catchup': False
}
def _sub_process():
cmd = "python /usr/local/airflow/easy.py"
ret = subprocess.run(shlex.split(cmd)).returncode
if ret == 0:
print("Task Completed With Success")
else:
print("Error")
with DAG('sub_processing',
schedule_interval='@hourly',
default_args=default_args) as dag:
#Define tasks/operators
sub = PythonOperator(
task_id='sub',
python_callable=_sub_process
)
- Create a script that is named easy.py that prints time:
import time
import sys
maybe_fail = round(time.time() * 1000)
if maybe_fail % 5 == 0:
for count in range(30):
print(time.ctime())
time.sleep(5)
else :
for count in range(2):
print(time.ctime())
time.sleep(5)
- Run the DAG and view the output of the tasks.
How often does this problem occur? Once? Every time etc? Every time unless CAN_FORK = False in standard_task_runner.py
cc @kaxil
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (8 by maintainers)
Top Results From Across the Web
[GitHub] [airflow] xinbinhuang commented on issue #14813: Forked ...
[GitHub] [airflow] xinbinhuang commented on issue #14813: Forked processes aren't logged by tasks using the Python Operator · 2021-05-07 Thread GitBox.
Read more >Python logging from multiple processes - Stack Overflow
This creates 2 file table entries and one v-node table entry. Opening a file then forking isn't supposed to work, but it seems...
Read more >multiprocessing — Process-based parallelism — Python 3.11 ...
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process....
Read more >Using fork() to produce 1 parent and its 3 child processes
Program to create four processes (1 parent and 3 children) where they terminates in a sequence as follows : (a) Parent process terminates...
Read more >Tasks — Airflow Documentation
Same definition applies to downstream task, which needs to be a direct child of the other task. There are two ways of declaring...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @rbankston, I don’t think this is an airflow bug, but just how the
subprocess
and thelogging
module work. In a word, you need to propagate back thestdout
from your subprocess back in order to log in airflow. There are two options:stdout
yourself, i.e.:In this case, the
p.stdout
is a simple string and you may need to parse it with some extra logic.SubprocessHook
. This will automatically capture the stdout and log them in airflow for you.cc: @dstandish
@kaxil you can apparently remove that issue from 2.3.0 https://github.com/apache/airflow/milestone/36?closed=1