question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Intermittent failures when deleting pods

See original GitHub issue

Apache Airflow version

2.3.1

What happened

Intermittent error when deleting pods after pod state=SUCCEEDED

[2022-06-08, 07:49:40 KST] {kubernetes_pod.py:434} INFO - Deleting pod: hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90
[2022-06-08, 07:49:40 KST] {taskinstance.py:1890} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 390, in execute
    follow=True,
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 245, in fetch_container_logs
    last_log_time = consume_logs(since_time=last_log_time, follow=follow)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 221, in consume_logs
    follow=follow,
  File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 324, in wrapped_f
    return self(f, *args, **kw)
  File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 360, in iter
    raise retry_exc.reraise()
  File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 193, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/airflow/.local/lib/python3.7/site-packages/tenacity/__init__.py", line 407, in __call__
    result = fn(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 332, in read_pod_logs
    **additional_kwargs,
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23747, in read_namespaced_pod_log
    return self.read_namespaced_pod_log_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23880, in read_namespaced_pod_log_with_http_info
    collection_formats=collection_formats)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
    query_params=query_params)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 07 Jun 2022 22:49:40 GMT', 'Content-Length': '490'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"PodLogOptions \\"hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90\\" is invalid: sinceSeconds: Invalid value: -64: must be greater than 0","reason":"Invalid","details":{"name":"hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90","kind":"PodLogOptions","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: -64: must be greater than 0","field":"sinceSeconds"}]},"code":422}\n'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 403, in execute
    remote_pod=remote_pod,
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 426, in cleanup
    f'Pod {pod and pod.metadata.name} returned a failure:{error_message}\n{remote_pod}'
airflow.exceptions.AirflowException: Pod hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90 returned a failure:
None
[2022-06-08, 07:49:40 KST] {taskinstance.py:1401} INFO - Marking task as FAILED. dag_id=common_kudu_to_hdfs_dag, task_id=hive_kcai_dim_gift_product_cate_task, execution_date=20220607T195742, start_date=20220607T224848, end_date=20220607T224940
[2022-06-08, 07:49:40 KST] {standard_task_runner.py:97} ERROR - Failed to execute job 970 for task hive_kcai_dim_gift_product_cate_task (Pod hive-kcai-dim-gift-product-cat-4d6acdf27cab46edbb7652fc8d224c90 returned a failure:
None; 56)

What you think should happen instead

Marking task as SUCCESS

How to reproduce

see above

Operating System

k8s version : v1.17.12

helm chart

apiVersion: v2
name: airflow
version: 1.2.0
appVersion: 2.1.4

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==3.4.0 apache-airflow-providers-celery==2.1.4 apache-airflow-providers-cncf-kubernetes==4.0.2 apache-airflow-providers-docker==2.7.0 apache-airflow-providers-elasticsearch==3.0.3 apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-google==7.0.0 apache-airflow-providers-grpc==2.0.4 apache-airflow-providers-hashicorp==2.2.0 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-microsoft-azure==3.9.0 apache-airflow-providers-mysql==2.2.3 apache-airflow-providers-odbc==2.0.4 apache-airflow-providers-postgres==4.1.0 apache-airflow-providers-redis==2.0.4 apache-airflow-providers-sendgrid==2.0.4 apache-airflow-providers-sftp==2.6.0 apache-airflow-providers-slack==4.2.3 apache-airflow-providers-sqlite==2.1.3 apache-airflow-providers-ssh==2.4.4

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:12 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jjournetcommented, Aug 29, 2022

@takersk so after configuring NTP, the issue was gone. It came back a few days later and I realized the servers I added to my cluster were not configured for NTP. So it confirms that the issue for me was related to time synchro.

1reaction
jjournetcommented, Jul 1, 2022

@potiuk I had unrelated warning in the scheduler log (scheduled time in the future), and realized my NTP was not configured, and I had almost 1min difference between some of the nodes. I configured NTP and now all my nodes are within a few ms of each other. I’ll test and check if it fixes the issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to stop/crash/fail a pod manually in Kubernetes/Openshift
You can delete pods manually as mentioned by Graham, but the rest are trickier. For simulating an OOM, you could kubectl exec into...
Read more >
Intermittent DNS failures in Google Container Engine
So far, my interim fix was to delete the pods that were failing, and let kubernetes reschedule them, and keep doing this until...
Read more >
Troubleshooting Applications | Kubernetes
The most common cause of Waiting pods is a failure to pull the image. ... The first thing to do is to delete...
Read more >
Intermittent HTTP failures to Ingress controllers (#8319) - GitLab
So today I did some investigation and introspecting the pods themselves, as well as watching the errors via httping myself. My gut feeling...
Read more >
Pods in status CrashLoopBackOff - ITOM Practitioner Portal
You have to delete the failed pods. Once the pods are deleted, they are recreated automatically and should run without error.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found