SMTP connection in email utils has no default timeout which causes the connection to hang indefinitely if it can't reach the SMTP server.
See original GitHub issueKPE was a red herring see the comments below the issue template for what I figured out.
Apache Airflow version: 1.10.10
Kubernetes version (if you are using kubernetes) (use kubectl version
):
v1.16.8-eks-e16311
Environment:
AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY=apache/airflow DATABASE_DB=airflow AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG=1.10.10-python3.7 BOILING_LAND_WEB_PORT_8080_TCP_PROTO=tcp AIRFLOW__KUBERNETES__IN_CLUSTER=True DATABASE_PASSWORD=snip AIRFLOW_GID=50000 SHLVL=1 AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token KUBERNETES_PORT_443_TCP_PROTO=tcp BOILING_LAND_WEB_SERVICE_HOST=172.20.191.242 LC_MESSAGES=C.UTF-8 PYTHON_PIP_VERSION=20.0.2 KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1 DATABASE_HOST=snip AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection AIRFLOW__EMAIL__EMAIL_BACKEND=airflow.utils.email.send_email_smtp LC_CTYPE=C.UTF-8 AIRFLOW__SMTP__SMTP_STARTTLS=False AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME=boiling-land PYTHON_GET_PIP_SHA256=snip AIRFLOW__CORE__SQL_ALCHEMY_CONN=snip KUBERNETES_SERVICE_HOST=172.20.0.1 LC_ALL=C.UTF-8 AIRFLOW__CORE__REMOTE_LOGGING=True KUBERNETES_PORT=tcp://172.20.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 AIRFLOW_KUBERNETES_ENVIRONMENT_VARIABLES_KUBE_CLIENT_REQUEST_TIMEOUT_SEC=50 AIRFLOW__KUBERNETES__GIT_BRANCH=master PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/d59197a3c169cef378a22428a3fa99d33e080a5d/get-pip.py AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=False PATH=/home/airflow/.local/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo/ PYTHON_BASE_IMAGE=python:3.7-slim-buster AIRFLOW_UID=50000 AIRFLOW__CORE__FERNET_KEY=snip DEBIAN_FRONTEND=noninteractive BOILING_LAND_WEB_PORT_8080_TCP=tcp://172.20.191.242:8080 AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__FERNET_KEY=snip AIRFLOW__SMTP__SMTP_SSL=False BOILING_LAND_WEB_PORT_8080_TCP_ADDR=172.20.191.242 AIRFLOW__SMTP__SMTP_HOST=email-smtp.us-east-1.amazonaws.com _=/usr/bin/env
-
Cloud provider or hardware configuration: AWS EKS
-
OS (e.g. from /etc/os-release): NAME=“Amazon Linux” VERSION=“2”
-
Kernel (e.g.
uname -a
): Linux<AWS_INTERNAL_HOSTNAME>.x86_64 #1 SMP Thu May 7 18:48:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux -
Install tools:
-
Others:
What happened: Using the KubernetesExecutor a pod that prepares to launch a task running the KubernetesPodOperator is launched. This task fails due to an issue in the task definition such as an invalid option. The pod does not exit immediately it takes about 40minutes for it to exit after the failure and report its state in the UI. Interestingly the execution time is correctly listed as < 1 second.
It also says that it is marking the job failed in the task logs on the launcher pod, which after about 40 minutes the state does change:
[2020-08-03 04:28:32,844] {taskinstance.py:1202} INFO - Marking task as FAILED.dag_id=arxiv_crawler_pipeline, task_id=launch_crawl_pod, execution_date=20200803T042640, start_date=20200803T042652, end_date=20200803T042832
The scheduler logs on the launcher pod say nothing about the failure though:
[2020-08-03 04:26:51,543] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-08-03 04:26:51,544] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/crawlers/arxiv/arxiv_crawl_pipeline.py
/home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py:159: PendingDeprecationWarning: Invalid arguments were passed to KubernetesPodOperator (task_id: launch_crawl_pod). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'reattach_on_restart': True, 'log_events_on_failure': True}
super(KubernetesPodOperator, self).__init__(*args, resources=None, **kwargs)
/home/airflow/.local/lib/python3.7/site-packages/airflow/sensors/base_sensor_operator.py:71: PendingDeprecationWarning: Invalid arguments were passed to HttpSensor (task_id: wait_for_finish). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'result_check': <function check_http_response at 0x7fdc378bf320>}
super(BaseSensorOperator, self).__init__(*args, **kwargs)
Running %s on host %s <TaskInstance: arxiv_crawler_pipeline.launch_crawl_pod 2020-08-03T04:26:40.850022+00:00 [queued]> arxivcrawlerpipelinelaunchcrawlpod-4c7e99ae14704b2b8fa0d64db508
What you expected to happen: The pod should exit immediately and report the failed task state in the metadata database which should then be reflected in the job UI in a much more timely fashion.
No idea, I’ve been looking at this for about 12 hours now and this report is my, I can’t figure it out moment. How to reproduce it:
Set up an airflow cluster on a Kubernetes cluster with the KubernetesExecutor and create a job that attempts to launch a KubernetesPodOperator task that will fail either in the attempt to launch or in the pod that is created by the task itself.
How often does this problem occur? Once? Every time etc? Every single time. Any relevant logs to include? Put them here in side a detail tag:
The logs don’t really give any insight to why there is such a dramatic lag between failure and updating the metadata.
There seems to be a similar issue that was never properly addressed in the bowls of JIRA:
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
I can confirm that this is the cause. There is no current option to even pass a timeout to SMTP right now but disabling sending emails cleared it right up. Failure was correctly reported in a 5min window instead of 40min+ I’m going to look into adding both a sensible default timeout and a config option.
default timeout was added in https://github.com/apache/airflow/pull/12801