question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kubernetes operator retry regression

See original GitHub issue

Apache Airflow version: 1.10.12

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.15.9

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Debian 9 (Stretch)

What happened: As of Airflow 1.10.12, and going back to sometime around 1.10.10 or 1.10.11, the behavior of the retry mechanism in the kubernetes pod operator regressed. Previously when a pod failed due to an error, Airflow would spin up a new pod in kubernetes on retry. As of 1.10.12 Airflow now tries to re-use the same broken pod over and over:

INFO - found a running pod with labels {'dag_id': 'my_dad', 'task_id': 'my_task', 'execution_date': '2020-11-04T1300000000-e807cde8a', 'try_number': '6'} but a different try_number. Will attach to this pod and monitor instead of starting new one

This is bad because most failures we encounter are due to the underlying “physical” hardware failing and retrying on the same pod is pointless, it will never succeed.

What you expected to happen: I would expect the k8s Airflow operator to start a new pod that would allow it to be scheduled on a new k8s node that does not have an underlying “physical” hardware problem, just like it was on earlier versions of Airflow.

How to reproduce it: Run a kubernetes pod operator task with a retry count set and error the node in a way that it can never succeed.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
dimbermancommented, Dec 8, 2020

Hi @philipherrmann @pceric please let me know if this bug is fixed. I would also recommend using the cncf.kubernetes backport provider instead of the operator in airflow itself as those are being deprecated in 2.0 (you’ll also get fixes much faster through providers)

0reactions
pcericcommented, Dec 14, 2020

I installed 1.10.14 today and while the behavior is a bit odd, it does work. If I have retries set to 5, Airflow will run 10 retries with every odd retry being a “dummy retry”, gathering the output from the previous failure. But since everything works as expected I’m calling this fixed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] pceric commented on issue #12111: Kubernetes ...
[GitHub] [airflow] pceric commented on issue #12111: Kubernetes operator retry regression · GitBox Fri, 27 Nov 2020 16:18:03 -0800.
Read more >
Kubernetes Operators Best Practices - Red Hat Hybrid Cloud
Kubernetes Operators are processes connecting to the master API and watching for events, typically on a limited number of resource types.
Read more >
10 Things You Should Know Before Writing a Kubernetes ...
Our controller needs to look at the current number of running pods and compare it to the desired number specified in the CR,...
Read more >
Release Notes :: WebLogic Kubernetes Operator
Resolved an issue related to WebLogic cluster replication with Istio 1.10 and resolved several issues related to introspector failure, retry, and status.
Read more >
Troubleshooting kubeadm | Kubernetes
This is a regression introduced in kubeadm 1.15. The issue is fixed in 1.20. Cannot use the metrics-server securely in a kubeadm cluster....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found