Worker never running tasks or failing them with no explanation for many simultaneous tasks
See original GitHub issueApache Airflow version: 2.0.0rc1
Kubernetes version (if you are using kubernetes) (use kubectl version
): 1.19.4
Environment:
- Cloud provider or hardware configuration: Laptop with 6 cores and 32GB RAM
- OS (e.g. from /etc/os-release): Ubuntu 20.04.1 LTS
- Kernel (e.g.
uname -a
): 5.4.0-56-generic - Install tools:
- Others:
What happened: I am running the 2.0.0 release candidate in minikube using the celery executor. It was installed using the helm chart in git, with the executor changed and a persistent volume claim for storing dags added. ‘workers.replicas’ is set to 2. I’m testing different scaling options by launching large amounts of tasks and evaluating how quickly/consistently they run. The DAG is run manually through the web server and on most runs, either some of the tasks will fail with no explanation or some tasks will be left in the ‘queued’ state and never run. The tasks in the ‘queued’ state are shown as ‘active’ in the flower dashboard but do not appear to be actually running.
As part of my testing I have increased the values of AIRFLOW__CORE__DAG_CONCURRENCY and AIRFLOW__CELERY__WORKER_CONCURRENCY. This seems like it might exacerbate the problem but I have reproduced it with the default settings.
What you expected to happen: All run successfully
What do you think went wrong? Initially I thought I was over-taxing the system, but resource monitoring has shown nothing indicating this. My system has 11Gb of RAM free and 4 CPUs, and CPU utilization never went over 30%.
How to reproduce it: Attached is a simple DAG that produces the issue on my setup. concurrent_workflow.zip
Anything else we need to know: I haven’t seen anything indicating an error in the logs, but would be happy to provide if requested.
How often does this problem occur? Once? Every time etc? The majority of my runs (75-90%) have resulted in at between 1 and 4 tasks that are stuck in the ‘queued’ state. The failed tasks are less frequent (approximately 25%)
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Thank you – I won’t be able to try these out until Monday but I’ll let you know what I find.
This issue is reported against older version of Airflow. Please check with latest Airflow version.