ECS operator throws an error on attempting to reattach to ECS tasks
See original GitHub issueApache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon 3.2.0
Apache Airflow version
2.2.5 (latest released)
Operating System
Linux / ECS
Deployment
Other Docker-based deployment
Deployment details
We are running Docker on Open Shift 4
What happened
There seems to be a bug in the code for ECS operator, during the “reattach” flow. We are running into some instability issues that cause our Airflow scheduler to restart. When the scheduler restarts while a task is running using ECS, the ECS operator will try to reattach to the ECS task once the Airflow scheduler restarts. The code works fine finding the ECS task and attaching to it, but then when it tries to fetch the logs, it throws the following error:
Traceback (most recent call last):   File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1334, in _run_raw_task     self._execute_task_with_callbacks(context)   File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1460, in _execute_task_with_callbacks     result = self._execute_task(context, self.task)   File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1516, in _execute_task     result = execute_callable(context=context)   File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 70, in wrapper     return func(*args, session=session, **kwargs)   File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 295, in execute     self.task_log_fetcher = self._get_task_log_fetcher()   File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 417, in _get_task_log_fetcher     log_stream_name = f"{self.awslogs_stream_prefix}/{self.ecs_task_id}" AttributeError: 'EcsOperator' object has no attribute 'ecs_task_id'
At this point, the operator will fail and the task will be marked for retries and eventually gets marked as failed, while on the ECS side, the ECS task is running fine. The manual way to fix this would be to wait for the ECS task to complete, then mark the task as successful and trigger downstream tasks. This is not very practical, since the task can take a long time (in our case the task can take hours)
What you think should happen instead
I expect that the ECS operator should be able to reattach and pull the logs as normal.
How to reproduce
Configure a task that would run using the ECS operator, and make sure it takes a very long time. Start the task, and once the logs starts flowing to Airflow, restart the Airflow scheduler. Wait for the scheduler to restart and check that upon retry, the task would be able to attach and fetch the logs.
Anything else
When restarting Airflow, it tries to kill the task at hand. In our case, we didn’t give the permission to the AWS role to kill the running ECS tasks, and therefore the ECS tasks keep running during the restart of Airflow. Others might not have this setup, and therefore they won’t run into the “reattach” flow, and they won’t encounter the issue reported here. This is not a good option for us, since our tasks can take hours to complete, and we don’t want to interfere with their execution.
We also need to improve the stability of the Open Shift infrastructure where Airflow is running, so that the scheduler doesn’t restart so often, but that is a different story.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
 
Code of Conduct
- I agree to follow this project’s Code of Conduct
 
Issue Analytics
- State:
 - Created a year ago
 - Comments:5 (2 by maintainers)
 

Top Related StackOverflow Question
@o-nikolas: I tried to create a PR with the code change and possible unit tests at https://github.com/apache/airflow/pull/22879
Thanks for flagging this one @potiuk
And thanks @fshehadeh for catching this issue. It is definitely a bug 😄
It seems like a quick fix (I’ve already coded it) but it will take some time to test it before pushing out a PR.