Improvements to the "Executor reports task instance finished (failed) although the task says its queued" exception messages
See original GitHub issueDescription
One day our production Airflow production (1.10.9) environment started to raise the following exception for multiple tasks in different DAGs:
Try 0 out of 1
Exception:
Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally?
If you let it retry the tasks, eventually it would end. However, it made lots of ETLs increase the execution time like 400%.
Later, thanks for an obscure Stackoverflow answer, we discovered that this was happening because the Airflow database was overloaded. Probably the Airflow Scheduler started to receive multiple timeouts and thought that healthy tasks were killed or something like. We changed it to use a better instance and the exceptions stopped without any other change.
Use case / motivation
Firstly, the start of the error message doesn’t seem to be natural Try 0 out of 1
. The task didn’t even run, I believe that it should flag that, showing something like Try 0 out of 1 (failed to start the task)
.
The rest of the exception message is pretty useless. I am far from an Airflow expert but I believe that the message could be more specific and change accordingly to the exception context, some ideas which I had according to some problems which could happen:
- Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally?
- Executor reports task instance finished (failed) although the task says its queued. Is the Scheduler task healthy?
- Executor reports task instance finished (failed) although the task says its queued. Is the Airflow connection with the Database OK?
That is it, if possible, small changes like that would avoid a lot of problems for other people like I had these days.
Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:8
- Comments:7 (3 by maintainers)
@ephraimbuddy Here is another case where Celery can fail to even run the task
Hello, we changed the RDS instance type from t3.micro to a t3.small. Check the CPU Credit balance on the Database Monitoring Tab to see if it is zeroed (so it is overloaded). We noticed that the database was the cause of that issue because of that.
Good luck!