Tasks stuck indefinitely when following container logs
See original GitHub issueApache Airflow version
2.2.4
What happened
I observed that some workers hanged randomly after being running. Also, logs were not being reported. After some time, the pod status was on “Completed” when inspecting from k8s api, but wasn’t on Airflow, which showed “status:running” for the pod. After some investigation, the issue is in the new kubernetes pod operator and is dependant of a current issue in the kubernetes api.
When a log rotate event occurs in kubernetes, the stream we consume on fetch_container_logs(follow=True,…) is no longer being feeded.
Therefore, the k8s pod operator hangs indefinetly at the middle of the log. Only a sigterm could terminate it as logs consumption is blocking execute() to finish.
Ref to the issue in kubernetes: https://github.com/kubernetes/kubernetes/issues/59902
Linking to https://github.com/apache/airflow/issues/12103 for reference, as the result is more or less the same for end user (although the root cause is different)
What you think should happen instead
Pod operator should not hang. Pod operator could follow the new logs from the container - this is out of scope of airflow as ideally the k8s api does it automatically.
Solution proposal
I think there are many possibilities to walk-around this from airflow-side to not hang indefinitely (like making fetch_container_logs
non-blocking for execute
and instead always block until status.phase.completed as it’s currently done when get_logs is not true).
How to reproduce
Running multiple tasks will sooner or later trigger this. Also, one can configure a more aggressive logs rotation in k8s so this race is triggered more often.
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow==2.2.4
apache-airflow-providers-google==6.4.0
apache-airflow-providers-cncf-kubernetes==3.0.2
However, this should be reproducible with master.
Deployment
Official Apache Airflow Helm Chart
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:15 (13 by maintainers)
Top GitHub Comments
Cool. Assigned you 😃 !
@potiuk sure, I will submit one one of these days.