Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors when launching many pods simultaneously on GKE

See original GitHub issue

Apache Airflow version: 2.0.1

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.18.15-gke.1500

Environment:

Cloud provider or hardware configuration: Google Cloud
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

What happened:

When many pods are launched at the same time (typically through the kubernetesPodOperator), some will fail due to a 409 error encountered when modifying a resourceQuota object.

Full stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1310, in _execute_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 339, in execute
    final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
  File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 485, in create_new_pod_for_operator
    launcher.start_pod(self.pod, startup_timeout=self.startup_timeout_seconds)
  File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 109, in start_pod
    resp = self.run_pod_async(pod)
  File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 87, in run_pod_async
    raise e
  File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 81, in run_pod_async
    resp = self._client.create_namespaced_pod(
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
    (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/core_v1_api.py", line 6193, in create_namespaced_pod_with_http_info
    return self.api_client.call_api('/api/v1/namespaces/{namespace}/pods', 'POST',
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 330, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 163, in __call_api
    response_data = self.request(method, url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 371, in request
    return self.rest_client.POST(url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 260, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '9e2e6081-4e52-41fc-8caa-6db9d546990c', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 30 Mar 2021 15:41:33 GMT', 'Content-Length': '342'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"gke-resource-quotas\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"gke-resource-quotas","kind":"resourcequotas"},"code":409}

This is a known issue in kubernetes, as outlined in this issue (in which other users specifically mention airflow): https://github.com/kubernetes/kubernetes/issues/67761

While this can be handled by task retries, I would like to discuss whether its worth handling this error within the kubernetespodoperator itself. We could probably check for the error in the pod launcher and automatically retry a few times in this case.

Let me know if you think this is something worth fixing on our end. If so, please assign this issue to me and I can put up a PR in the next week or so.

If you think that this issue is best handled via task retries or fixed upstream in kubernetes, feel free to close this.

What you expected to happen:

I would expect that Airflow could launch many pods at the same time.

How to reproduce it:

Create a DAG which runs 30+ kubernetespodoperator tasks at the same time. Likely a few will fail.

Anything else we need to know:

Issue Analytics

State:
Created 2 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

potiukcommented, Dec 13, 2021

Airflow 1.10 is end of life as of June 2021 - and it stopped getting even critical security fixes. https://github.com/apache/airflow#version-life-cycle

Please upgrade ASAP to Airflow 2.

In case you have not seen the latest Log4J security issue - it does not affect Ariflow, but there might be fufure similar discoveries that might. So if you want to be sure that in case of similar problem you will get a fix fast - just make sure you are on Airflow 2.

1reaction

ashbcommented, Mar 31, 2021

I agree with you that cases like this where Airflow was never even able to start the task don’t feel like they should “consume” a retry attempt.

Top Results From Across the Web

Troubleshooting | Google Kubernetes Engine (GKE)

GKE returns an error if there are issues with a workload's Pods. You can check the status of a Pod using the kubectl...

How to Debug Kubernetes Pending Pods and Scheduling ...

Learn how to debug Pending pods that fail to get scheduled due to resource constraints, taints, affinity rules, and other reasons.

Kubernetes CrashLoopBackOff: What it is, and how to fix it?

Learn to visualize, alert, and troubleshoot a Kubernetes CrashLoopBackOff: A pod starting, crashing, starting again, and crashing again.

Multi-Container Pods in Kubernetes

At the same time, a Pod can contain more than one container, ... Also, Pods allow managing several tightly coupled application containers as...

Kubernetes multi-container pods and container communication

At the same time, a Pod can contain more than one container, usually because these containers are relatively tightly coupled. How tightly ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Errors when launching many pods simultaneously on GKE

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Bump google-api-core package version to make apache-airflow-providers-google compatible with python3.6

OdbcHook string values in connect_kwargs dict converts to None