question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

See original GitHub issue

Apache Airflow version: 2.0.0

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.18.15

Environment:

  • Cloud provider or hardware configuration: AWS (running in k8s cluster provisioned by kops)
  • OS (e.g. from /etc/os-release): using Docker image apache/airflow:2.0.0-python3.8
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a): Linux 5.4.92-flatcar #1 SMP Wed Jan 27 16:53:10 -00 2021 x86_64 GNU/Linux
  • Install tools:
  • Others:
    • installed snowflake-connector-python==2.3.9

What happened: From time to time seems to be a network glitch between K8s pods and RDS database (postgres 11.6). worker-pod fails with error Failed to log action with (psycopg2.OperationalError) SSL SYSCALL error: Connection timed out but the task even if has retry=1 is marked as Failed, and no retry happens. Considering that we have a dag with serial tasks, all the dag is marked as FAILED. This seems to happen only in the case of Operational Error (network glitch).

What you expected to happen:

I expected that the task will be retried. We experience the same issue with airflow 1.10.14, and the with the 2nd retry the tasks was marked as successful.

How to reproduce it: I didn’t manage to reproduce.

Anything else we need to know: value of max_db_retries in airflow.cfg is set to 3 The issue happen time to time, depends on the day also 2/3 times per day.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ephraimbuddycommented, Oct 4, 2021

At @d3centr your case seems different and is resolved by https://github.com/apache/airflow/pull/16301 released in 2.1.3.

Also, this ticket has been resolved by https://github.com/apache/airflow/pull/17819 which will be released in 2.2

To confirm, @nicor88 , when you get this operational error, do you also see in the scheduler log message like this: <TaskInstance: somedag.taskid [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally? You can share the scheduler log and task logs, maybe I’m mixing things up

1reaction
ephraimbuddycommented, Sep 14, 2021

@ephraimbuddy Can you look at this one too – it is similar as the process executor events where we should retry instead of fail

Yes. I will take a look

Read more comments on GitHub >

github_iconTop Results From Across the Web

Airflow task not retrying properly upon failure - Stack Overflow
The short answer is that this is exactly how you define retries. When you get Airflow up and running with its scheduler component, ......
Read more >
Tasks — Airflow Documentation
It will not retry when this error is raised. If the sensor fails due to other reasons such as network outages during the...
Read more >
Known issues | Cloud Composer
As a workaround, Airflow 2 environments in Cloud Composer 1.17.0-preview.9 and later versions are configured to perform two retries for a failed task...
Read more >
Task Instances exiting without failure, and retrying with retries ...
Hi, I have a dag using taskflow on Airflow 2.0.0 that has been failing and retrying a task without producing an error.
Read more >
apache-airflow Changelog - pyup.io
``SLAMiss`` is nullable and not always given back when pulling task instances (27423) ... Retry on Airflow Schedule DAG Run DB Deadlock (26347)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found