Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing

See original GitHub issue

Apache Airflow version: 2.0.0

Kubernetes version (if you are using kubernetes) (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.14-gke.1600", GitCommit:"7c407f5cc8632f9af5a2657f220963aa7f1c46e7", GitTreeState:"clean", BuildDate:"2020-12-07T09:22:27Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:
- Airflow metadata database is hooked up to a PostgreSQL instance

What happened:

Airflow 2.0.0 running on the KubernetesExecutor has many tasks stuck in “scheduled” or “queued” state which never get resolved.
The setup has a default_pool of 16 slots.
Currently no slots are used (see Screenshot), but all slots are queued.
No work is executed any more. The Executor or Scheduler is stuck.
There are many many tasks stuck in “scheduled” state
- Tasks in “scheduled” state say ('Not scheduling since there are %s open slots in pool %s and require %s pool slots', 0, 'default_pool', 1) That is simply not true, because there is nothing running on the cluster and there are always 16 tasks stuck in “queued”.
There are many tasks stuck in “queued” state
- Tasks in “queued” state say Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run. That is also not true. Nothing is running on the cluster and Airflow is likely just lying to itself. It seems the KubernetesExecutor and the scheduler easily go out of sync.

What you expected to happen:

Airflow should resolve scheduled or queued tasks by itself once the pool has available slots
Airflow should use all available slots in the pool
It should be possible to clear a couple hundred tasks and expect the system to stay consistent

How to reproduce it:

Vanilla Airflow 2.0.0 with KubernetesExecutor on Python 3.7.9

requirements.txt

pyodbc==4.0.30
pycryptodomex==3.9.9
apache-airflow-providers-google==1.0.0
apache-airflow-providers-odbc==1.0.0
apache-airflow-providers-postgres==1.0.0
apache-airflow-providers-cncf-kubernetes==1.0.0
apache-airflow-providers-sftp==1.0.0
apache-airflow-providers-ssh==1.0.0

The only reliable way to trigger that weird bug is to clear the task state of many tasks at once. (> 300 tasks)

Anything else we need to know:

Don’t know, as always I am happy to help debug this problem. The scheduler/executer seems to go out of sync and never back in sync again with the state of the world.

We actually planned to upscale our Airflow installation with many more simultaneous tasks. With these severe yet basic scheduling/queuing problems we cannot move forward at all.

Another strange, likely unrelated observation, the scheduler always uses 100% of the CPU. Burning it. Even with no scheduled or now queued tasks, its always very very busy.

Workaround:

The only workaround for this problem I could find so far, is to manually go in, find all tasks in “queued” state and clear them all at once. Without that, the whole cluster/Airflow just stays stuck like it is.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:145 (66 by maintainers)

Top GitHub Comments

42reactions

nikitarulzcommented, Mar 12, 2021

This is happening with Celery executor as well. I’m using Airflow 2.0.0 with Celery executor and mysql, facing similar issue. Sorry for the basic question but I’m unable to figure-out the manual way to find all tasks in “queued” state and clearing them. Can somebody help here.

14reactions

Jorrickscommented, Jun 15, 2021

We were running into a similar issue as we have 100+ dags and around a 1000 tasks.

I figured out there is a bug in the celery_executor which I still want to fix myself and contribute.

Summary of that problem: At the start of the scheduler, the celery_executor class instance of the scheduler picks up everything from ‘dead’ schedulers (your previous run). That is (if you run one scheduler) every TaskInstance in the Running, Queued or Scheduled state. Then once it verified that this task is not running(takes 10 minutes), it clears most of the references but forgets a crucial one, making it such that the scheduler can NEVER start this task anymore. You can still start it via the webserver because that has its own celery_executor class instance.

What we noticed:

Many tasks were very slowly to be scheduled even though the workers were almost fully idle.
The TaskInstances were stuck on Queued or Scheduled.
Restarting the scheduler didn’t work.
Once restarted (with debug logging enabled) you’d get a logging line indicating you have negative open slots: [2021-06-14 14:07:31,932] {base_executor.py:152} DEBUG - -62 open slots

What you can do to verify whether you have the same issue:

Stop the scheduler
Clear all TaskInstances that are Queued or Scheduled
Start the scheduler

Our fix:

Increase the airflow.cfg parallelism -> from 32 to 1000. This is what could easily deadlock your scheduler after a restart. Because it uses this variable to see if it can launch any new task. If you had 50 tasks in Scheduled waiting, it will deadlock your entire scheduler.
Increase the default pool size (for a speedup) -> from 128 to 1000
For any task that the scheduler can’t run anymore. Do the procedure mentioned above or kick-start it yourself by clicking the task instance followed by “Ignore all deps”, “Ignore Task states”, “Ignore Task Deps” and finally “Run”.

Hope this helps anyone and saves you a couple days of debugging 😃

Top Results From Across the Web

Airflow 1.9.0 is queuing but not launching tasks - Stack Overflow

What I found out is that airflow scheduler mainly responsible for putting your scheduled tasks in to "Queued Slots" (pool), while airflow celery ......

[GitHub] [airflow] mongakshay removed a comment on issue ...

... removed a comment on issue #13542: Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing.

Task Is Not Being Scheduled As Expected In Airflow - ADocLib

[GitHub] [airflow] zachliu commented on issue #13542: Task stuck in scheduled or queued state pool has all slots queued nothing is executing GitBox...

Performance tuning for Apache Airflow on Amazon MWAA

The number of processes the Celery Executor uses to sync task state. ... finding and queuing tasks, and executing queued tasks in the...

7 Common Errors to Check When Debugging Airflow DAGs

You wrote a new DAG that needs to run every hour and you're ready to ... isn't scheduling tasks or running at all,...