question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pod logs from KubernetesPodOperator occasionally get replaced with "Task is not able to run"

See original GitHub issue

Apache Airflow version: 1.10.10 Kubernetes version (if you are using kubernetes) (use kubectl version): 1.17.2

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): CentOS Linux
  • Kernel (e.g. uname -a): Linux airflow-worker-0 5.6.13-1.el7.elrepo.x86_64 #1 SMP Thu May 14 08:05:24 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:

What happened:

I run Airflow as a way to orchestrate jobs on Kubernetes using the KubernetesPodOperator. While most of the time logs appear correctly in the Airflow webserver, I do notice increasingly that the logs do not appear and instead just show a message, “Task is not able to be run”, such as in the snippet below:

*** Reading remote log from s3://*****/****/******/2020-06-30T10:00:00+00:00/1.log.
[2020-06-30 23:07:40,362] {taskinstance.py:663} INFO - Dependencies not met for <TaskInstance: ****.***** 2020-06-30T10:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2020-06-30 23:07:40,363] {logging_mixin.py:112} INFO - [2020-06-30 23:07:40,363] {local_task_job.py:91} INFO - Task is not able to be run

Unusually, when I go check what is happening on the Kubernetes cluster, the pod is actually running and emitting logs when I run a kubectl logs command. When the pod is complete, Airflow will reflect that the task has completed as well.

What you expected to happen: I expected pod logs to be printed out.

How to reproduce it: Very unfortunately, I am unsure what circumstances cause this error and am currently trying to gather evidence to replicate.

Anything else we need to know:

  • I have remote logging set to an S3 bucket.
  • I’ve noticed this issue increasingly with the 1.10.10 update, and I find this error daily.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:13
  • Comments:45 (24 by maintainers)

github_iconTop GitHub Comments

9reactions
raj-manvarcommented, Sep 10, 2020

Few new findings:

  • was able to reproduce this issue for task running more than a minute by setting visibility_timeout = 60 in [celery_broker_transport_options] in airflow.cfg file.

  • Therefore, this is happening because Celery expects the task to complete within an hour and if not assigns another worker for the task, during this transition, worker uploads the logs with “Task is not able to run” to S3

  • Can see another worker getting same task from logs Received task: airflow.executors.celery_executor.execute_command[b40cacbb-9dd3-4681-8454-0e1df2dbc910] with same id seconding that Celery is assigning this task to another worker.

  • Modifying “visibility_timeout = 86400 # 1day” in airflow.cfg doesn’t resolve this issue and logs in UI are corrupted after an hour

  • Even tried “visibility_timeout = 7200 # 2 hours” in airflow.cfg but can still see this issue after an hour.

  • Seems the issue is similar to https://github.com/celery/celery/issues/5935, but according to this it should be resolved in Celery version 4.4.5 but, we still see the same issue even though Airflow 1.10.10 uses Celery version 4.4.6

( CC: @chrismclennon @dimberman )

8reactions
Pseverincommented, Jul 24, 2020

I’m receiving the same message, when I’m running long task (duration more than 1 day) with KubernetesPodOperator.

{taskinstance.py:624} INFO - Dependencies not met for <TaskInstance: **** 2020-07-23T11:23:35+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Task finishes successfully, but Airflow status is marked as FAILED and downstream tasks are not run

Read more comments on GitHub >

github_iconTop Results From Across the Web

Airflow on kubernetes cannot fetch logs - Stack Overflow
I'm running a task using a KubernetesPodOperator, with in_cluster=True parameters, and it runs well, I can even kubectl logs pod-name and all ...
Read more >
airflow.providers.cncf.kubernetes.utils.pod_manager
Source code for airflow.providers.cncf.kubernetes.utils.pod_manager ... For a long-running container, sometimes the log read may be interrupted Such errors ...
Read more >
Debug Running Pods | Kubernetes
This page explains how to debug Pods running (or crashing) on a Node. Before you begin Your Pod should already be scheduled and...
Read more >
Data Infrastructure - GitLab
All DAGs are created using the KubernetesPodOperator so while working from local we need a cluster where we should be able to spin...
Read more >
kubernetes pod operator throwing error when task succeeds
I am using the kubernetes pod operator to run a container on the kubernetes ... using resources on the cluster until eventually no...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found