KubernetesExecutor: All task pods are terminating with error while task succeed
See original GitHub issueApache Airflow version: 2.0.2+ Kubernetes version: 1.20 Helm chart version: 1.0.0
What happened: Successful task pods are terminating with error.
I have did further testing with different versions the see below my test results:
- 2.0.1-python3.8 - OK
- 2.0.2-python3.6 - NOK (helm chart default image)
- 2.0.2-python3.8 - NOK
- 2.1.0-python3.8 - NOK
▶ kubectl -n airflow get pods
NAME READY STATUS RESTARTS AGE
airflow-s3-sync-1621853400-x8hzv 0/1 Completed 0 11s
airflow-scheduler-865c754f55-6fdkt 2/2 Running 0 5m45s
airflow-scheduler-865c754f55-hqbv2 2/2 Running 0 5m45s
airflow-scheduler-865c754f55-hw65l 2/2 Running 0 5m45s
airflow-statsd-84f4f9898-r9xxm 1/1 Running 0 5m45s
airflow-webserver-7c66d4cd99-28jxv 1/1 Running 0 5m45s
airflow-webserver-7c66d4cd99-d8wrf 1/1 Running 0 5m45s
airflow-webserver-7c66d4cd99-xn2hq 1/1 Running 0 5m45s
simpledagsleep.4862fcd4ec8c4adfb10e421feee88745 0/1 Error 0 2m25s
▶ kubectl -n airflow logs simpledagsleep.4862fcd4ec8c4adfb10e421feee88745
BACKEND=postgresql
DB_HOST=XXXXXXXXXXXXXXXXXXXXXXXX
DB_PORT=5432
[2021-05-24 10:47:57,843] {dagbag.py:451} INFO - Filling up the DagBag from /opt/airflow/dags/simple_dag.py
[2021-05-24 10:47:58,147] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-05-24 10:47:58,780] {base_aws.py:391} WARNING - Unable to use Airflow Connection for credentials.
[2021-05-24 10:47:58,780] {base_aws.py:392} INFO - Fallback on boto3 credential strategy
[2021-05-24 10:47:58,781] {base_aws.py:395} INFO - Creating session using boto3 credential strategy region_name=eu-central-1
Running <TaskInstance: simple_dag.sleep 2021-05-24T10:47:46.486143+00:00 [queued]> on host simpledagsleep.4862fcd4ec8c4adfb10e421feee88745
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 8, in <module>
sys.exit(main())
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
args.func(args)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 89, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 235, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
_run_task_by_local_task_job(args, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 120, in _run_task_by_local_task_job
run_job.run()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run
self._execute()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 142, in _execute
self.on_kill()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 157, in on_kill
self.task_runner.on_finish()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/base_task_runner.py", line 178, in on_finish
self._error_file.close()
File "/usr/local/lib/python3.8/tempfile.py", line 499, in close
self._closer.close()
File "/usr/local/lib/python3.8/tempfile.py", line 436, in close
unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpt63agqia'
How to reproduce it:
simple_dag.py
import time
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
"owner" : "airflow",
"depends_on_past" : False,
"start_date" : datetime(2020, 1, 1),
"email" : ["support@airflow.com"],
"email_on_failure": False,
"email_on_retry" : False,
"retries" : 1,
"retry_delay" : timedelta(minutes=5)
}
def sleep():
time.sleep(60)
return True
with DAG("simple_dag", default_args=default_args, schedule_interval="@once", catchup=False) as dag:
t1 = PythonOperator(task_id="sleep", python_callable=sleep)
myconf.yaml
executor: KubernetesExecutor
fernetKey: "XXXXXXXXXX"
defaultAirflowTag: "2.0.2-python3.8"
airflowVersion: "2.0.2"
config:
logging:
colored_console_log: "True"
remote_logging: "True"
remote_base_log_folder: "cloudwatch://${log_group_arn}"
remote_log_conn_id: "aws_default"
core:
load_examples: "False"
store_dag_code: "True"
parallelism: "1000"
dag_concurrency: "1000"
max_active_runs_per_dag: "1000"
non_pooled_task_slot_count: "1000"
scheduler:
job_heartbeat_sec: 5
scheduler_heartbeat_sec: 5
parsing_processes: 2
webserver:
base_url: "http://${web_url}/airflow"
secrets:
backend: "airflow.contrib.secrets.aws_systems_manager.SystemsManagerParameterStoreBackend"
backend_kwargs: XXXXXXXXXX
webserver:
replicas: 3
nodeSelector:
namespace: airflow
serviceAccount:
name: ${service_account_name}
annotations:
eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
service:
type: NodePort
ingress:
enabled: true
web:
precedingPaths:
- path: "/*"
serviceName: "ssl-redirect"
servicePort: "use-annotation"
path: "/airflow/*"
annotations:
external-dns.alpha.kubernetes.io/hostname: ${web_url}
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=3600
alb.ingress.kubernetes.io/certificate-arn: ${aws_acm_certificate_arn}
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
scheduler:
replicas: 3
nodeSelector:
namespace: airflow
serviceAccount:
name: ${service_account_name}
annotations:
eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
workers:
serviceAccount:
name: ${service_account_name}
annotations:
eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
dags:
persistence:
enabled: true
storageClassName: ${storage_class_dags}
logs:
persistence:
enabled: true
storageClassName: ${storage_class_logs}
postgresql:
enabled: false
data:
metadataSecretName: ${metadata_secret_name}
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (10 by maintainers)
Top Results From Across the Web
Jobs | Kubernetes
A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully...
Read more >[GitHub] [airflow] andormarkus opened a new issue #16020
[GitHub] [airflow] andormarkus opened a new issue #16020: KubernetesExecutor: All task pods are terminating with error while task succeed.
Read more >Airflow kubernetesExecutor : Worker pod terminate after creating
Is your webserver and scheduler pods running fine without any issues ? From the worker pod logs it looks like it cannot access...
Read more >Source code for airflow.executors.kubernetes_executor
Thus on starting up the scheduler let's check every "Queued" task to see if it has been launched (ie: if there is a...
Read more >Running Airflow Using Kubernetes Executor and Kubernetes ...
Just the task container itself is completed. This will lead to the Pod Phase become Not Ready in the success task, and Error...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ephraimbuddy how is it going with the issue, we are also experiencing this on version 2.1.1
@ephraimbuddy Any update on this issue, the issue seems to be still persists in the airflow 2.1.4 . I am specifically getting this error when I am passing the pod template