Intermittent failures when deleting pods
See original GitHub issueApache Airflow version
2.3.1
What happened
Intermittent error when deleting pods after pod state=SUCCEEDED
[2022-06-08, 07:49:40 KST] {kubernetes_pod.py:434} INFO - Deleting pod: hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90
[2022-06-08, 07:49:40 KST] {taskinstance.py:1890} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 390, in execute
follow=True,
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 245, in fetch_container_logs
last_log_time = consume_logs(since_time=last_log_time, follow=follow)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 221, in consume_logs
follow=follow,
File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 324, in wrapped_f
return self(f, *args, **kw)
File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 404, in __call__
do = self.iter(retry_state=retry_state)
File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 360, in iter
raise retry_exc.reraise()
File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 193, in reraise
raise self.last_attempt.result()
File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 407, in __call__
result = fn(*args, **kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 332, in read_pod_logs
**additional_kwargs,
File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23747, in read_namespaced_pod_log
return self.read_namespaced_pod_log_with_http_info(name, namespace, **kwargs) # noqa: E501
File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23880, in read_namespaced_pod_log_with_http_info
collection_formats=collection_formats)
File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
_preload_content, _request_timeout, _host)
File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
_request_timeout=_request_timeout)
File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
headers=headers)
File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
query_params=query_params)
File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 234, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 07 Jun 2022 22:49:40 GMT', 'Content-Length': '490'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"PodLogOptions \\"hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90\\" is invalid: sinceSeconds: Invalid value: -64: must be greater than 0","reason":"Invalid","details":{"name":"hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90","kind":"PodLogOptions","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: -64: must be greater than 0","field":"sinceSeconds"}]},"code":422}\n'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 403, in execute
remote_pod=remote_pod,
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 426, in cleanup
f'Pod {pod and pod.metadata.name} returned a failure:{error_message}\n{remote_pod}'
airflow.exceptions.AirflowException: Pod hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90 returned a failure:
None
[2022-06-08, 07:49:40 KST] {taskinstance.py:1401} INFO - Marking task as FAILED. dag_id=common_kudu_to_hdfs_dag, task_id=hive_kcai_dim_gift_product_cate_task, execution_date=20220607T195742, start_date=20220607T224848, end_date=20220607T224940
[2022-06-08, 07:49:40 KST] {standard_task_runner.py:97} ERROR - Failed to execute job 970 for task hive_kcai_dim_gift_product_cate_task (Pod hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90 returned a failure:
None; 56)
What you think should happen instead
Marking task as SUCCESS
How to reproduce
see above
Operating System
k8s version : v1.17.12
helm chart
apiVersion: v2
name: airflow
version: 1.2.0
appVersion: 2.1.4
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==3.4.0 apache-airflow-providers-celery==2.1.4 apache-airflow-providers-cncf-kubernetes==4.0.2 apache-airflow-providers-docker==2.7.0 apache-airflow-providers-elasticsearch==3.0.3 apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-google==7.0.0 apache-airflow-providers-grpc==2.0.4 apache-airflow-providers-hashicorp==2.2.0 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-microsoft-azure==3.9.0 apache-airflow-providers-mysql==2.2.3 apache-airflow-providers-odbc==2.0.4 apache-airflow-providers-postgres==4.1.0 apache-airflow-providers-redis==2.0.4 apache-airflow-providers-sendgrid==2.0.4 apache-airflow-providers-sftp==2.6.0 apache-airflow-providers-slack==4.2.3 apache-airflow-providers-sqlite==2.1.3 apache-airflow-providers-ssh==2.4.4
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:12 (3 by maintainers)
@takersk so after configuring NTP, the issue was gone. It came back a few days later and I realized the servers I added to my cluster were not configured for NTP. So it confirms that the issue for me was related to time synchro.
@potiuk I had unrelated warning in the scheduler log (scheduled time in the future), and realized my NTP was not configured, and I had almost 1min difference between some of the nodes. I configured NTP and now all my nodes are within a few ms of each other. I’ll test and check if it fixes the issue