Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SMTP connection in email utils has no default timeout which causes the connection to hang indefinitely if it can't reach the SMTP server.

See original GitHub issue

KPE was a red herring see the comments below the issue template for what I figured out.

Apache Airflow version: 1.10.10

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.16.8-eks-e16311 Environment:

KUBERNETES_SERVICE_PORT_HTTPS=443 AIRFLOW__SMTP__SMTP_PORT=25 AIRFLOW__KUBERNETES__NAMESPACE=airflow AIRFLOW__SMTP__SMTP_PASSWORD=*snip* AIRFLOW__SMTP__SMTP_USER=*snip* KUBERNETES_SERVICE_PORT=443 BOILING_LAND_WEB_PORT_8080_TCP_PORT=8080 REDIS_PASSWORD=fjODRhL3FL6n0y4cA AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_BASE_LOGS_FOLDER=*snip* BOILING_LAND_WEB_SERVICE_PORT=8080 HOSTNAME=boiling-land-scheduler-7bcb794c75-gjzjx PYTHON_VERSION=3.7.7 LANGUAGE=C.UTF-8 POSTGRES_PASSWORD=*snip* PIP_VERSION=19.0.2 AIRFLOW__KUBERNETES__DELETE_WORKER_PODS_ON_FAILURE=False AIRFLOW__WEBSERVER__BASE_URL=http://localhost:8080 AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection BOILING_LAND_WEB_SERVICE_PORT_WEB=8080 AIRFLOW__CORE__DONOT_PICKLE=false BOILING_LAND_WEB_PORT=tcp://172.20.191.242:8080 PWD=/opt/airflow AIRFLOW_VERSION=1.10.10 AIRFLOW__SMTP__SMTP_MAIL_FROM=*snip* AWS_ROLE_ARN=*snip* AIRFLOW__CORE__LOAD_EXAMPLES=False TZ=Etc/UTC AIRFLOW__KUBERNETES__GIT_REPO=git@gitlab.com:whize/airflow-dags.git AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=/opt/airflow/dags HOME=/home/airflow AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF=boiling-land-env LANG=C.UTF-8 KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443 AIRFLOW_HOME=/opt/airflow DATABASE_USER=postgres AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME=airflow-kube-pods-git DATABASE_PORT=5432 GPG_KEY=*snip* AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOGGING=True AIRFLOW__CORE__EXECUTOR=KubernetesExecutor AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://*snip* AIRFLOW__KUBERNETES__RUN_AS_USER=50000 AIRFLOW__CORE__BASE_LOG_FOLDER=/opt/airflow/logs AIRFLOW__CORE__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log AIRFLOW__CORE__ENABLE_XCOM_PICKLING=false TERM=xterm AIRFLOW__SCHEDULER__MAX_THREADS=8 AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE=5 AIRFLOW_CONN_S3_CONNECTION=aws://

AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY=apache/airflow DATABASE_DB=airflow AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG=1.10.10-python3.7 BOILING_LAND_WEB_PORT_8080_TCP_PROTO=tcp AIRFLOW__KUBERNETES__IN_CLUSTER=True DATABASE_PASSWORD=snip AIRFLOW_GID=50000 SHLVL=1 AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token KUBERNETES_PORT_443_TCP_PROTO=tcp BOILING_LAND_WEB_SERVICE_HOST=172.20.191.242 LC_MESSAGES=C.UTF-8 PYTHON_PIP_VERSION=20.0.2 KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1 DATABASE_HOST=snip AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection AIRFLOW__EMAIL__EMAIL_BACKEND=airflow.utils.email.send_email_smtp LC_CTYPE=C.UTF-8 AIRFLOW__SMTP__SMTP_STARTTLS=False AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME=boiling-land PYTHON_GET_PIP_SHA256=snip AIRFLOW__CORE__SQL_ALCHEMY_CONN=snip KUBERNETES_SERVICE_HOST=172.20.0.1 LC_ALL=C.UTF-8 AIRFLOW__CORE__REMOTE_LOGGING=True KUBERNETES_PORT=tcp://172.20.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 AIRFLOW_KUBERNETES_ENVIRONMENT_VARIABLES_KUBE_CLIENT_REQUEST_TIMEOUT_SEC=50 AIRFLOW__KUBERNETES__GIT_BRANCH=master PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/d59197a3c169cef378a22428a3fa99d33e080a5d/get-pip.py AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=False PATH=/home/airflow/.local/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo/ PYTHON_BASE_IMAGE=python:3.7-slim-buster AIRFLOW_UID=50000 AIRFLOW__CORE__FERNET_KEY=snip DEBIAN_FRONTEND=noninteractive BOILING_LAND_WEB_PORT_8080_TCP=tcp://172.20.191.242:8080 AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__FERNET_KEY=snip AIRFLOW__SMTP__SMTP_SSL=False BOILING_LAND_WEB_PORT_8080_TCP_ADDR=172.20.191.242 AIRFLOW__SMTP__SMTP_HOST=email-smtp.us-east-1.amazonaws.com _=/usr/bin/env

Cloud provider or hardware configuration: AWS EKS
OS (e.g. from /etc/os-release): NAME=“Amazon Linux” VERSION=“2”
Kernel (e.g. uname -a): Linux<AWS_INTERNAL_HOSTNAME>.x86_64 #1 SMP Thu May 7 18:48:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

What happened: Using the KubernetesExecutor a pod that prepares to launch a task running the KubernetesPodOperator is launched. This task fails due to an issue in the task definition such as an invalid option. The pod does not exit immediately it takes about 40minutes for it to exit after the failure and report its state in the UI. Interestingly the execution time is correctly listed as < 1 second.

It also says that it is marking the job failed in the task logs on the launcher pod, which after about 40 minutes the state does change:

[2020-08-03 04:28:32,844] {taskinstance.py:1202} INFO - Marking task as FAILED.dag_id=arxiv_crawler_pipeline, task_id=launch_crawl_pod, execution_date=20200803T042640, start_date=20200803T042652, end_date=20200803T042832

The scheduler logs on the launcher pod say nothing about the failure though:

[2020-08-03 04:26:51,543] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-08-03 04:26:51,544] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/crawlers/arxiv/arxiv_crawl_pipeline.py
/home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py:159: PendingDeprecationWarning: Invalid arguments were passed to KubernetesPodOperator (task_id: launch_crawl_pod). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'reattach_on_restart': True, 'log_events_on_failure': True}
  super(KubernetesPodOperator, self).__init__(*args, resources=None, **kwargs)
/home/airflow/.local/lib/python3.7/site-packages/airflow/sensors/base_sensor_operator.py:71: PendingDeprecationWarning: Invalid arguments were passed to HttpSensor (task_id: wait_for_finish). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'result_check': <function check_http_response at 0x7fdc378bf320>}
  super(BaseSensorOperator, self).__init__(*args, **kwargs)
Running %s on host %s <TaskInstance: arxiv_crawler_pipeline.launch_crawl_pod 2020-08-03T04:26:40.850022+00:00 [queued]> arxivcrawlerpipelinelaunchcrawlpod-4c7e99ae14704b2b8fa0d64db508

What you expected to happen: The pod should exit immediately and report the failed task state in the metadata database which should then be reflected in the job UI in a much more timely fashion.

No idea, I’ve been looking at this for about 12 hours now and this report is my, I can’t figure it out moment. How to reproduce it:

Set up an airflow cluster on a Kubernetes cluster with the KubernetesExecutor and create a job that attempts to launch a KubernetesPodOperator task that will fail either in the attempt to launch or in the pod that is created by the task itself.

How often does this problem occur? Once? Every time etc? Every single time. Any relevant logs to include? Put them here in side a detail tag:

The logs don’t really give any insight to why there is such a dramatic lag between failure and updating the metadata.

There seems to be a similar issue that was never properly addressed in the bowls of JIRA:

https://issues.apache.org/jira/browse/AIRFLOW-3374?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

iantbutler01commented, Aug 4, 2020

I can confirm that this is the cause. There is no current option to even pass a timeout to SMTP right now but disabling sending emails cleared it right up. Failure was correctly reported in a 5min window instead of 40min+ I’m going to look into adding both a sensible default timeout and a config option.

0reactions

eladkalcommented, Feb 25, 2021

default timeout was added in https://github.com/apache/airflow/pull/12801

Top Results From Across the Web

Troubleshoot SMTP connectivity or timeout issues with ...

1. Run the following telnet, netcat (nc), or Test-NetConnection commands. · 2. Note the output. · 3. For unsuccessful connections, confirm that ...

Email sending stops due to mail job hanging in sendMessage()

Mail stops sending because the mail thread cannot connect to the server, and there is no timeout set. Consequently, the mail thread will...

A Timeout Error occurs for Email Actions (SMTP over port 25)

This is usually caused by a failure to connect because of the port being blocked, network issues or a problem on the SMTP...

[#EMAIL-100] The default connection timeout should be set to ...

When > ~10 open connections exists the SMTP server starts resetting connections. Once in a while this will cause a thread to hang...

Configuring timeouts for SMTP connections

mail.smtp.connectiontimeout int Socket connection timeout value in milliseconds. Default is infinite timeout. Recommended value is 180000 (180 seconds) or ...