Errors when launching many pods simultaneously on GKE
See original GitHub issueApache Airflow version: 2.0.1
Kubernetes version (if you are using kubernetes) (use kubectl version
): 1.18.15-gke.1500
Environment:
- Cloud provider or hardware configuration: Google Cloud
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
What happened:
When many pods are launched at the same time (typically through the kubernetesPodOperator), some will fail due to a 409 error encountered when modifying a resourceQuota object.
Full stack trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1310, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 339, in execute
final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 485, in create_new_pod_for_operator
launcher.start_pod(self.pod, startup_timeout=self.startup_timeout_seconds)
File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 109, in start_pod
resp = self.run_pod_async(pod)
File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 87, in run_pod_async
raise e
File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 81, in run_pod_async
resp = self._client.create_namespaced_pod(
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
(data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/core_v1_api.py", line 6193, in create_namespaced_pod_with_http_info
return self.api_client.call_api('/api/v1/namespaces/{namespace}/pods', 'POST',
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 330, in call_api
return self.__call_api(resource_path, method,
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 163, in __call_api
response_data = self.request(method, url,
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 371, in request
return self.rest_client.POST(url,
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 260, in POST
return self.request("POST", url,
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '9e2e6081-4e52-41fc-8caa-6db9d546990c', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 30 Mar 2021 15:41:33 GMT', 'Content-Length': '342'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"gke-resource-quotas\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"gke-resource-quotas","kind":"resourcequotas"},"code":409}
This is a known issue in kubernetes, as outlined in this issue (in which other users specifically mention airflow): https://github.com/kubernetes/kubernetes/issues/67761
While this can be handled by task retries, I would like to discuss whether its worth handling this error within the kubernetespodoperator itself. We could probably check for the error in the pod launcher and automatically retry a few times in this case.
Let me know if you think this is something worth fixing on our end. If so, please assign this issue to me and I can put up a PR in the next week or so.
If you think that this issue is best handled via task retries or fixed upstream in kubernetes, feel free to close this.
What you expected to happen:
I would expect that Airflow could launch many pods at the same time.
How to reproduce it:
Create a DAG which runs 30+ kubernetespodoperator tasks at the same time. Likely a few will fail.
Anything else we need to know:
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
Airflow 1.10 is end of life as of June 2021 - and it stopped getting even critical security fixes. https://github.com/apache/airflow#version-life-cycle
Please upgrade ASAP to Airflow 2.
In case you have not seen the latest Log4J security issue - it does not affect Ariflow, but there might be fufure similar discoveries that might. So if you want to be sure that in case of similar problem you will get a fix fast - just make sure you are on Airflow 2.
I agree with you that cases like this where Airflow was never even able to start the task don’t feel like they should “consume” a retry attempt.