question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Airflow .airflowignore not handling soft link properly.

See original GitHub issue

Apache Airflow version

2.3.0 (latest released)

What happened

Soft link and folder under same root folder will be handled as the same relative path. Say i have dags folder which looks like this:

-dags:
  -- .airflowignore
  -- folder
  -- soft-links-to-folder -> folder

and .airflowignore:

folder/

both folder and soft-links-to-folder will be ignored.

What you think should happen instead

Only the folder should be ignored. This is the expected behavior in airflow 2.2.4, before i upgraded. The root cause is that both _RegexpIgnoreRule and _GlobIgnoreRule is calling relative_to method to get search path.

How to reproduce

check @tirkarthi comment for the test case.

Operating System

ubuntu

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
tirkarthicommented, May 6, 2022

Here is a sample test case of the report. I guess this is more with the fact resolve is called before match that will resolve symlink to the original folder here. As per report “soft-links-to-folder” will resolve to “folder” and get ignored

https://github.com/apache/airflow/blob/6cc41abf6912fd2705b9ef7cf368c888c43c8af8/airflow/utils/file.py#L68

Test case passes before changes in https://github.com/apache/airflow/pull/22051 . cc : @ianbuss

# tests/plugins/test_plugin_ignore.py
def test_symlink_not_ignored(self):
    shutil.rmtree(self.test_dir)
    self.test_dir = tempfile.mkdtemp(prefix="onotole")
    source = os.path.join(self.test_dir, "folder")
    target = os.path.join(self.test_dir, "symlink")
    py_file = os.path.join(source, "hello_world.py")
    ignore_file = os.path.join(self.test_dir, ".airflowignore")
    os.mkdir(source)
    os.symlink(source, target)

    with open(ignore_file, 'w') as f:
        f.write("folder")

    with open(py_file, 'w') as f:
        f.write("print('hello world')")


    ignore_list_file = ".airflowignore"
    found = []

    for path in find_path_from_directory(self.test_dir, ignore_list_file):
        found.append(path)

    assert os.path.join(self.test_dir, "symlink", "hello_world.py") in found
0reactions
ianbusscommented, May 6, 2022

Have prepped a simple initial PR which should hopefully restore the original behaviour (and includes the test case provided by @tirkarthi - thanks!) but would be good to get some additional eyes on it. If we want to make larger changes to the symlink handling that should perhaps be a future PR with further thought? Depends on the timeline of 2.3.1 I think.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Airflow does not pick up symlinked DAGs - Stack Overflow
I want to link in some DAGs from a directory outside of my dags_folder . How ever when I create a symlink using...
Read more >
Best Practices - Apache Airflow
Airflow scheduler tries to continuously make sure that what you have in DAGs is correctly reflected in scheduled tasks. Specifically you should not...
Read more >
Airflow Documentation - Read the Docs
Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs ...
Read more >
Troubleshooting Airflow scheduler issues | Cloud Composer
DAG parsing and scheduling in Cloud Composer 1 and Airflow 1 ... in the queue and for some reason it's not possible to...
Read more >
Manage DAG and task dependencies in Airflow
In Airflow, your pipelines are defined as Directed Acyclic Graphs (DAGs). ... Downstream task: A dependent task that cannot run until an upstream...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found