Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

See original GitHub issue

Apache Airflow version: 2.0.0

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.18.15

Environment:

Cloud provider or hardware configuration: AWS (running in k8s cluster provisioned by kops)
OS (e.g. from /etc/os-release): using Docker image apache/airflow:2.0.0-python3.8

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a): Linux 5.4.92-flatcar #1 SMP Wed Jan 27 16:53:10 -00 2021 x86_64 GNU/Linux
Install tools:
Others:
- installed snowflake-connector-python==2.3.9

What happened: From time to time seems to be a network glitch between K8s pods and RDS database (postgres 11.6). worker-pod fails with error Failed to log action with (psycopg2.OperationalError) SSL SYSCALL error: Connection timed out but the task even if has retry=1 is marked as Failed, and no retry happens. Considering that we have a dag with serial tasks, all the dag is marked as FAILED. This seems to happen only in the case of Operational Error (network glitch).

What you expected to happen:

I expected that the task will be retried. We experience the same issue with airflow 1.10.14, and the with the 2nd retry the tasks was marked as successful.

How to reproduce it: I didn’t manage to reproduce.

Anything else we need to know: value of max_db_retries in airflow.cfg is set to 3 The issue happen time to time, depends on the day also 2/3 times per day.

Issue Analytics

State:
Created 3 years ago
Comments:12 (7 by maintainers)

Top GitHub Comments

1reaction

ephraimbuddycommented, Oct 4, 2021

At @d3centr your case seems different and is resolved by https://github.com/apache/airflow/pull/16301 released in 2.1.3.

Also, this ticket has been resolved by https://github.com/apache/airflow/pull/17819 which will be released in 2.2

To confirm, @nicor88 , when you get this operational error, do you also see in the scheduler log message like this: <TaskInstance: somedag.taskid [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally? You can share the scheduler log and task logs, maybe I’m mixing things up

1reaction

ephraimbuddycommented, Sep 14, 2021

@ephraimbuddy Can you look at this one too – it is similar as the process executor events where we should retry instead of fail

Yes. I will take a look

Top Results From Across the Web

Airflow task not retrying properly upon failure - Stack Overflow

The short answer is that this is exactly how you define retries. When you get Airflow up and running with its scheduler component, ......

Tasks — Airflow Documentation

It will not retry when this error is raised. If the sensor fails due to other reasons such as network outages during the...

Known issues | Cloud Composer

As a workaround, Airflow 2 environments in Cloud Composer 1.17.0-preview.9 and later versions are configured to perform two retries for a failed task...

Task Instances exiting without failure, and retrying with retries ...

Hi, I have a dag using taskflow on Airflow 2.0.0 that has been failing and retrying a task without producing an error.

apache-airflow Changelog - pyup.io

``SLAMiss`` is nullable and not always given back when pulling task instances (27423) ... Retry on Airflow Schedule DAG Run DB Deadlock (26347)...