question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors when launching many pods simultaneously on GKE

See original GitHub issue

Apache Airflow version: 2.0.1

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.18.15-gke.1500

Environment:

  • Cloud provider or hardware configuration: Google Cloud
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

What happened:

When many pods are launched at the same time (typically through the kubernetesPodOperator), some will fail due to a 409 error encountered when modifying a resourceQuota object.

Full stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1310, in _execute_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 339, in execute
    final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
  File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 485, in create_new_pod_for_operator
    launcher.start_pod(self.pod, startup_timeout=self.startup_timeout_seconds)
  File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 109, in start_pod
    resp = self.run_pod_async(pod)
  File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 87, in run_pod_async
    raise e
  File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 81, in run_pod_async
    resp = self._client.create_namespaced_pod(
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
    (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/core_v1_api.py", line 6193, in create_namespaced_pod_with_http_info
    return self.api_client.call_api('/api/v1/namespaces/{namespace}/pods', 'POST',
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 330, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 163, in __call_api
    response_data = self.request(method, url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 371, in request
    return self.rest_client.POST(url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 260, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '9e2e6081-4e52-41fc-8caa-6db9d546990c', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 30 Mar 2021 15:41:33 GMT', 'Content-Length': '342'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"gke-resource-quotas\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"gke-resource-quotas","kind":"resourcequotas"},"code":409}

This is a known issue in kubernetes, as outlined in this issue (in which other users specifically mention airflow): https://github.com/kubernetes/kubernetes/issues/67761

While this can be handled by task retries, I would like to discuss whether its worth handling this error within the kubernetespodoperator itself. We could probably check for the error in the pod launcher and automatically retry a few times in this case.

Let me know if you think this is something worth fixing on our end. If so, please assign this issue to me and I can put up a PR in the next week or so.

If you think that this issue is best handled via task retries or fixed upstream in kubernetes, feel free to close this.

What you expected to happen:

I would expect that Airflow could launch many pods at the same time.

How to reproduce it:

Create a DAG which runs 30+ kubernetespodoperator tasks at the same time. Likely a few will fail.

Anything else we need to know:

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
potiukcommented, Dec 13, 2021

Airflow 1.10 is end of life as of June 2021 - and it stopped getting even critical security fixes. https://github.com/apache/airflow#version-life-cycle

Please upgrade ASAP to Airflow 2.

In case you have not seen the latest Log4J security issue - it does not affect Ariflow, but there might be fufure similar discoveries that might. So if you want to be sure that in case of similar problem you will get a fix fast - just make sure you are on Airflow 2.

1reaction
ashbcommented, Mar 31, 2021

I agree with you that cases like this where Airflow was never even able to start the task don’t feel like they should “consume” a retry attempt.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting | Google Kubernetes Engine (GKE)
GKE returns an error if there are issues with a workload's Pods. You can check the status of a Pod using the kubectl...
Read more >
How to Debug Kubernetes Pending Pods and Scheduling ...
Learn how to debug Pending pods that fail to get scheduled due to resource constraints, taints, affinity rules, and other reasons.
Read more >
Kubernetes CrashLoopBackOff: What it is, and how to fix it?
Learn to visualize, alert, and troubleshoot a Kubernetes CrashLoopBackOff: A pod starting, crashing, starting again, and crashing again.
Read more >
Multi-Container Pods in Kubernetes
At the same time, a Pod can contain more than one container, ... Also, Pods allow managing several tightly coupled application containers as...
Read more >
Kubernetes multi-container pods and container communication
At the same time, a Pod can contain more than one container, usually because these containers are relatively tightly coupled. How tightly ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found