Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError
See original GitHub issueApache Airflow version: 2.0.0
Kubernetes version (if you are using kubernetes) (use kubectl version
): v1.18.15
Environment:
- Cloud provider or hardware configuration: AWS (running in k8s cluster provisioned by kops)
- OS (e.g. from /etc/os-release): using Docker image apache/airflow:2.0.0-python3.8
PRETTY_NAME="Debian GNU/Linux 10 (buster)" NAME="Debian GNU/Linux" VERSION_ID="10" VERSION="10 (buster)" VERSION_CODENAME=buster ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
- Kernel (e.g.
uname -a
): Linux 5.4.92-flatcar #1 SMP Wed Jan 27 16:53:10 -00 2021 x86_64 GNU/Linux - Install tools:
- Others:
- installed snowflake-connector-python==2.3.9
What happened:
From time to time seems to be a network glitch between K8s pods and RDS database (postgres 11.6). worker-pod fails with error Failed to log action with (psycopg2.OperationalError) SSL SYSCALL error: Connection timed out
but the task even if has retry=1 is marked as Failed, and no retry happens. Considering that we have a dag with serial tasks, all the dag is marked as FAILED. This seems to happen only in the case of Operational Error (network glitch).
What you expected to happen:
I expected that the task will be retried. We experience the same issue with airflow 1.10.14, and the with the 2nd retry the tasks was marked as successful.
How to reproduce it: I didn’t manage to reproduce.
Anything else we need to know:
value of max_db_retries
in airflow.cfg is set to 3
The issue happen time to time, depends on the day also 2/3 times per day.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (7 by maintainers)
Top GitHub Comments
At @d3centr your case seems different and is resolved by https://github.com/apache/airflow/pull/16301 released in 2.1.3.
Also, this ticket has been resolved by https://github.com/apache/airflow/pull/17819 which will be released in 2.2
To confirm, @nicor88 , when you get this operational error, do you also see in the scheduler log message like this:
<TaskInstance: somedag.taskid [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
You can share the scheduler log and task logs, maybe I’m mixing things upYes. I will take a look