question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

-----===== Airflow ROADMAP =====-----

See original GitHub issue

This github issue is meant to track the big roadmap items ahead of use and for the community to provide ideas, feedback, steer and comment

Lineage Annotations

Airflow knows a lot about your job dependencies, but how much does it know about your data objects (databases, tables, reports, …)? Lineage annotations would expose a way for users to define lineage between data objects explicitly and tie that to tasks and/or DAG. The framework may include hooks for programatic inference (HiveOperator could introspect the HQL and guess table names and lineage), and a way to override or complement these. The framework will most likely be fairly agnostic about your data objects, letting you namespace them however you want, and simply consider them as an array of parent/child relationship between strings. It may be nice to use dot notation and reserve the first part of the expression to object_type, allowing for color coding in a graph view and tying actions (links) and the like. This will course ship with nice graph visualization, browsing and navigation features.

Picklin’

stateless/codeless web servers and workers

Backfilll UI

Trigger a backfill from the UI

REST API

Essentially offering features similar to what is available in the CLI through a REST API, there may be some automagic solutions here that can figure out how to build REST specs from argparse, it would also insure consistency between the two

[done!]

Continuous integration with Travis-CI

Systematically run all unit tests against CDH and HDP on Python 2.7 and 3.5

Externally Triggered DAGs

Airflow currently assumes that you run your workflows on a fixed schedule interval. This is perfect for hourly, daily and weekly jobs. When thinking about “analytics as a service” and “analysis automation”, many use cases are more of the “on demand” nature. For instance if you have a workflow that processes someone’s genome when ordered to, or a workflow that builds a dataset on demand for data scientists based on parameters they provide, … This requires a new class of DAGs that are triggered externally, not on a fixed schedule interval.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:1
  • Comments:29 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
jlowincommented, Sep 30, 2015

A few “smaller” ideas that I’d love to push forward. I’ll add more as they occur to me.

  1. Stateless DAGs

    DAGs live as pickled objects in the database, and are unpickled as needed. This would allow DAG .py scripts to live at any local or remote location since they only need to be read once in order to become available to the entire airflow ecosystem. If a user/worker/task notices or decides that a DAG pickle has expired, it can refresh the DAG for every consumer.

  2. Improved worker logs

    We run airflow in a dockerized environment, where each worker is a separate microservice (and communicating across microservices is hard). As a result we have no way to access worker logs – the workers don’t communicate and even if they could, they could disappear/restart at any time. This isn’t my area of expertise, but perhaps workers could write logfiles to a remote filesystem (S3?) rather than keeping them locally. Then the webserver could load and display them appropriately

With regard to ExternallyTriggeredDags – here are some thoughts, some of which may make sense…

Today, DAGs are a hybrid of 1) task dependency graphs and 2) scheduler attributes (start_date, end_date, schedule_interval).

I suggest recasting a DAG as purely a dependency graph. It specifies the order in which tasks take place, but makes no claims about when they all start, stop, or repeat. It simply knows how to run its tasks in topological order, given an instruction to do so.

Those instructions could come from Scheduler objects which attach to DAGs and emit run times – maybe just once, maybe on a regular schedule, maybe irregularly, maybe totally at random. When the attached Scheduler tells the DAG to run, it runs.

In other words, every DAG is really an “Externally Triggered DAG” – it just happens that some of them are triggered on a regular schedule (and behave like DAGs do today). The Scheduler, instead of being a single master process that queries every DAG every 5 seconds to see if the DAG is supposed to run, becomes many smaller processes, one for each DAG, that actually kicks off the DAG at the right time. And the master process is nothing more than a Clock which tells each of the Schedulers that it’s time to update…

Basically, I’m worried about having two classes of DAGs – externally triggered vs “classic” – when there really isn’t a fundamental difference between them. Airflow 1.x is obviously designed around periodic DAG runs, so an externally triggered DAG represents something that needs special handling. But if Airflow 2.x is built around arbitrary run times, then these two DAG types could be unified very nicely. And I like that because they both have the same core purpose: run tasks in order.

1reaction
kapil-malikcommented, Sep 18, 2015

Nice ! +1 for REST APIs, it will be helpful to have read/query APIs to list DAGs / jobs and their status etc.

Also, do you think it’s possible that different steps of a DAG may have different resource requirements? Say for example memory ? I am a beginner in python, but I assume it would have a concept of specifying the requirements, just like we can do in java (via Xmx etc.). Further, for distributed resource schedulers (Yarn / mesos / kubernetes / GCE etc.), even cpu cores / disk can be specified. Although I am afraid those specifications might not be applicable for local executors. So the behavior will differ with envts (rather undesirable).

One more thing, do you have plans for UI for defining a DAG? say some sort of drag drop which allows to select from available operators and allows for setting their properties (and for using xcom within DAG)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Roadmap - Apache Airflow
Platform created by the community to programmatically author, schedule and monitor workflows.
Read more >
Apache Airflow 2.3 and beyond
Roadmap Concepts. ○ Making DAGs a joy to write. ○ Airflow should be the go to orchestrator for every data workflow job. ○...
Read more >
Looking ahead: What comes after Airflow 2.0?
Roadmap Concepts. ○ Making DAGs a joy to write. ○ Airflow should be the go to orchestrator for every data workflow job. ○...
Read more >
Roadmap to Apache Airflow & Data Pipeline in 2022 - Tealfeed
Apache Airflow is an open source tool for systematic writing, editing, and monitoring workflow. It is one of the strongest platforms used by...
Read more >
Data Engineer RoadMap Series II (Job Scheduler: Airflow)
Apache Airflow is an open-source workflow management platform for data engineering pipelines. We can use Airflow to author workflows as Directed Acyclic Graphs ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found