question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KubernetesExecutor: All task pods are terminating with error while task succeed

See original GitHub issue

Apache Airflow version: 2.0.2+ Kubernetes version: 1.20 Helm chart version: 1.0.0

What happened: Successful task pods are terminating with error.

I have did further testing with different versions the see below my test results:

  • 2.0.1-python3.8 - OK
  • 2.0.2-python3.6 - NOK (helm chart default image)
  • 2.0.2-python3.8 - NOK
  • 2.1.0-python3.8 - NOK
Screenshot 2021-05-24 at 12 40 30
▶ kubectl -n airflow get pods

NAME                                              READY   STATUS      RESTARTS   AGE
airflow-s3-sync-1621853400-x8hzv                  0/1     Completed   0          11s
airflow-scheduler-865c754f55-6fdkt                2/2     Running     0          5m45s
airflow-scheduler-865c754f55-hqbv2                2/2     Running     0          5m45s
airflow-scheduler-865c754f55-hw65l                2/2     Running     0          5m45s
airflow-statsd-84f4f9898-r9xxm                    1/1     Running     0          5m45s
airflow-webserver-7c66d4cd99-28jxv                1/1     Running     0          5m45s
airflow-webserver-7c66d4cd99-d8wrf                1/1     Running     0          5m45s
airflow-webserver-7c66d4cd99-xn2hq                1/1     Running     0          5m45s
simpledagsleep.4862fcd4ec8c4adfb10e421feee88745   0/1     Error       0          2m25s
▶ kubectl -n airflow logs simpledagsleep.4862fcd4ec8c4adfb10e421feee88745

BACKEND=postgresql
DB_HOST=XXXXXXXXXXXXXXXXXXXXXXXX
DB_PORT=5432

[2021-05-24 10:47:57,843] {dagbag.py:451} INFO - Filling up the DagBag from /opt/airflow/dags/simple_dag.py
[2021-05-24 10:47:58,147] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-05-24 10:47:58,780] {base_aws.py:391} WARNING - Unable to use Airflow Connection for credentials.
[2021-05-24 10:47:58,780] {base_aws.py:392} INFO - Fallback on boto3 credential strategy
[2021-05-24 10:47:58,781] {base_aws.py:395} INFO - Creating session using boto3 credential strategy region_name=eu-central-1
Running <TaskInstance: simple_dag.sleep 2021-05-24T10:47:46.486143+00:00 [queued]> on host simpledagsleep.4862fcd4ec8c4adfb10e421feee88745
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 89, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 235, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
    _run_task_by_local_task_job(args, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 120, in _run_task_by_local_task_job
    run_job.run()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run
    self._execute()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 142, in _execute
    self.on_kill()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 157, in on_kill
    self.task_runner.on_finish()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/base_task_runner.py", line 178, in on_finish
    self._error_file.close()
  File "/usr/local/lib/python3.8/tempfile.py", line 499, in close
    self._closer.close()
  File "/usr/local/lib/python3.8/tempfile.py", line 436, in close
    unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpt63agqia'

How to reproduce it:

simple_dag.py

import time

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    "owner"           : "airflow",
    "depends_on_past" : False,
    "start_date"      : datetime(2020, 1, 1),
    "email"           : ["support@airflow.com"],
    "email_on_failure": False,
    "email_on_retry"  : False,
    "retries"         : 1,
    "retry_delay"     : timedelta(minutes=5)
}


def sleep():
    time.sleep(60)
    return True


with DAG("simple_dag", default_args=default_args, schedule_interval="@once", catchup=False) as dag:
    t1 = PythonOperator(task_id="sleep", python_callable=sleep)

myconf.yaml

executor: KubernetesExecutor
fernetKey: "XXXXXXXXXX"

defaultAirflowTag: "2.0.2-python3.8"
airflowVersion: "2.0.2"


config:
  logging:
    colored_console_log: "True"
    remote_logging: "True"
    remote_base_log_folder: "cloudwatch://${log_group_arn}"
    remote_log_conn_id: "aws_default"
  core:
    load_examples: "False"
    store_dag_code: "True"
    parallelism: "1000"
    dag_concurrency: "1000"
    max_active_runs_per_dag: "1000"
    non_pooled_task_slot_count: "1000"
  scheduler:
    job_heartbeat_sec: 5
    scheduler_heartbeat_sec: 5
    parsing_processes: 2
  webserver:
    base_url: "http://${web_url}/airflow"
  secrets:
    backend: "airflow.contrib.secrets.aws_systems_manager.SystemsManagerParameterStoreBackend"
    backend_kwargs: XXXXXXXXXX

webserver:
  replicas: 3
  nodeSelector:
    namespace: airflow
  serviceAccount:
    name: ${service_account_name}
    annotations:
      eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
  service:
    type: NodePort
ingress:
  enabled: true
  web:
    precedingPaths:
      - path: "/*"
        serviceName: "ssl-redirect"
        servicePort: "use-annotation"
    path: "/airflow/*"
    annotations:
      external-dns.alpha.kubernetes.io/hostname: ${web_url}
      kubernetes.io/ingress.class: alb
      alb.ingress.kubernetes.io/scheme: internal
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=3600
      alb.ingress.kubernetes.io/certificate-arn: ${aws_acm_certificate_arn}
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
scheduler:
  replicas: 3
  nodeSelector:
    namespace: airflow
  serviceAccount:
    name: ${service_account_name}
    annotations:
      eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
workers:
  serviceAccount:
    name: ${service_account_name}
    annotations:
      eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
dags:
  persistence:
    enabled: true
    storageClassName: ${storage_class_dags}
logs:
  persistence:
    enabled: true
    storageClassName: ${storage_class_logs}
postgresql:
  enabled: false
data:
  metadataSecretName: ${metadata_secret_name}

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
trucnguyenlamcommented, Jul 16, 2021

@ephraimbuddy how is it going with the issue, we are also experiencing this on version 2.1.1

1reaction
deveshbajaj59commented, Sep 24, 2021

@ephraimbuddy Any update on this issue, the issue seems to be still persists in the airflow 2.1.4 . I am specifically getting this error when I am passing the pod template

Read more comments on GitHub >

github_iconTop Results From Across the Web

Jobs | Kubernetes
A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully...
Read more >
[GitHub] [airflow] andormarkus opened a new issue #16020
[GitHub] [airflow] andormarkus opened a new issue #16020: KubernetesExecutor: All task pods are terminating with error while task succeed.
Read more >
Airflow kubernetesExecutor : Worker pod terminate after creating
Is your webserver and scheduler pods running fine without any issues ? From the worker pod logs it looks like it cannot access...
Read more >
Source code for airflow.executors.kubernetes_executor
Thus on starting up the scheduler let's check every "Queued" task to see if it has been launched (ie: if there is a...
Read more >
Running Airflow Using Kubernetes Executor and Kubernetes ...
Just the task container itself is completed. This will lead to the Pod Phase become Not Ready in the success task, and Error...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found