question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ECS operator throws an error on attempting to reattach to ECS tasks

See original GitHub issue

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon 3.2.0

Apache Airflow version

2.2.5 (latest released)

Operating System

Linux / ECS

Deployment

Other Docker-based deployment

Deployment details

We are running Docker on Open Shift 4

What happened

There seems to be a bug in the code for ECS operator, during the “reattach” flow. We are running into some instability issues that cause our Airflow scheduler to restart. When the scheduler restarts while a task is running using ECS, the ECS operator will try to reattach to the ECS task once the Airflow scheduler restarts. The code works fine finding the ECS task and attaching to it, but then when it tries to fetch the logs, it throws the following error: Traceback (most recent call last): File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1334, in _run_raw_task self._execute_task_with_callbacks(context) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1460, in _execute_task_with_callbacks result = self._execute_task(context, self.task) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1516, in _execute_task result = execute_callable(context=context) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 70, in wrapper return func(*args, session=session, **kwargs) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 295, in execute self.task_log_fetcher = self._get_task_log_fetcher() File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 417, in _get_task_log_fetcher log_stream_name = f"{self.awslogs_stream_prefix}/{self.ecs_task_id}" AttributeError: 'EcsOperator' object has no attribute 'ecs_task_id'

At this point, the operator will fail and the task will be marked for retries and eventually gets marked as failed, while on the ECS side, the ECS task is running fine. The manual way to fix this would be to wait for the ECS task to complete, then mark the task as successful and trigger downstream tasks. This is not very practical, since the task can take a long time (in our case the task can take hours)

What you think should happen instead

I expect that the ECS operator should be able to reattach and pull the logs as normal.

How to reproduce

Configure a task that would run using the ECS operator, and make sure it takes a very long time. Start the task, and once the logs starts flowing to Airflow, restart the Airflow scheduler. Wait for the scheduler to restart and check that upon retry, the task would be able to attach and fetch the logs.

Anything else

When restarting Airflow, it tries to kill the task at hand. In our case, we didn’t give the permission to the AWS role to kill the running ECS tasks, and therefore the ECS tasks keep running during the restart of Airflow. Others might not have this setup, and therefore they won’t run into the “reattach” flow, and they won’t encounter the issue reported here. This is not a good option for us, since our tasks can take hours to complete, and we don’t want to interfere with their execution.

We also need to improve the stability of the Open Shift infrastructure where Airflow is running, so that the scheduler doesn’t restart so often, but that is a different story.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
fshehadehcommented, Apr 16, 2022

@o-nikolas: I tried to create a PR with the code change and possible unit tests at https://github.com/apache/airflow/pull/22879

1reaction
o-nikolascommented, Apr 14, 2022

Thanks for flagging this one @potiuk

And thanks @fshehadeh for catching this issue. It is definitely a bug 😄

It seems like a quick fix (I’ve already coded it) but it will take some time to test it before pushing out a PR.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot common errors with API calls in Amazon ECS
This error occurs when the ECS service that's being updated isn't active. Verify that the ECS service that's being updated is present in...
Read more >
ECS Task fails with InsufficientFreeAddressesInSubnet error ...
This meant that new tasks weren't able to get an IP and would fail. To fix, I had to: Log into the AWS...
Read more >
Running Jenkins jobs in AWS ECS with slave agents
In this article we'll cover exactly how to run Jenkins jobs in slave Fargate containers in AWS ECS. Using a worked example that...
Read more >
How to Fix 'Terminated With Exit Code 1' Error - Komodor
If your container does not use entrypoints, and you suspect Exit Code 1 is caused by an application problem, you can bash into...
Read more >
QRadar APARs 101 - IBM
Restart ecs-ep on the host(s) that are processing events from the ... IP' will throw the error as the rule cannot resolve the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found