-----===== Airflow ROADMAP =====-----
See original GitHub issueThis github issue is meant to track the big roadmap items ahead of use and for the community to provide ideas, feedback, steer and comment
Lineage Annotations
Airflow knows a lot about your job dependencies, but how much does it know about your data objects (databases, tables, reports, …)? Lineage annotations would expose a way for users to define lineage between data objects explicitly and tie that to tasks and/or DAG. The framework may include hooks for programatic inference (HiveOperator could introspect the HQL and guess table names and lineage), and a way to override or complement these. The framework will most likely be fairly agnostic about your data objects, letting you namespace them however you want, and simply consider them as an array of parent/child relationship between strings. It may be nice to use dot notation and reserve the first part of the expression to object_type
, allowing for color coding in a graph view and tying actions (links) and the like. This will course ship with nice graph visualization, browsing and navigation features.
Picklin’
stateless/codeless web servers and workers
Backfilll UI
Trigger a backfill from the UI
REST API
Essentially offering features similar to what is available in the CLI through a REST API, there may be some automagic solutions here that can figure out how to build REST specs from argparse, it would also insure consistency between the two
[done!]
Continuous integration with Travis-CI
Systematically run all unit tests against CDH and HDP on Python 2.7 and 3.5
Externally Triggered DAGs
Airflow currently assumes that you run your workflows on a fixed schedule interval. This is perfect for hourly, daily and weekly jobs. When thinking about “analytics as a service” and “analysis automation”, many use cases are more of the “on demand” nature. For instance if you have a workflow that processes someone’s genome when ordered to, or a workflow that builds a dataset on demand for data scientists based on parameters they provide, … This requires a new class of DAGs that are triggered externally, not on a fixed schedule interval.
Issue Analytics
- State:
- Created 8 years ago
- Reactions:1
- Comments:29 (18 by maintainers)
Top GitHub Comments
A few “smaller” ideas that I’d love to push forward. I’ll add more as they occur to me.
Stateless DAGs
DAGs live as pickled objects in the database, and are unpickled as needed. This would allow DAG
.py
scripts to live at any local or remote location since they only need to be read once in order to become available to the entire airflow ecosystem. If a user/worker/task notices or decides that a DAG pickle has expired, it can refresh the DAG for every consumer.Improved worker logs
We run airflow in a dockerized environment, where each worker is a separate microservice (and communicating across microservices is hard). As a result we have no way to access worker logs – the workers don’t communicate and even if they could, they could disappear/restart at any time. This isn’t my area of expertise, but perhaps workers could write logfiles to a remote filesystem (S3?) rather than keeping them locally. Then the webserver could load and display them appropriately
With regard to
ExternallyTriggeredDags
– here are some thoughts, some of which may make sense…Today, DAGs are a hybrid of 1) task dependency graphs and 2) scheduler attributes (
start_date
,end_date
,schedule_interval
).I suggest recasting a DAG as purely a dependency graph. It specifies the order in which tasks take place, but makes no claims about when they all start, stop, or repeat. It simply knows how to run its tasks in topological order, given an instruction to do so.
Those instructions could come from Scheduler objects which attach to DAGs and emit run times – maybe just once, maybe on a regular schedule, maybe irregularly, maybe totally at random. When the attached Scheduler tells the DAG to run, it runs.
In other words, every DAG is really an “Externally Triggered DAG” – it just happens that some of them are triggered on a regular schedule (and behave like DAGs do today). The Scheduler, instead of being a single master process that queries every DAG every 5 seconds to see if the DAG is supposed to run, becomes many smaller processes, one for each DAG, that actually kicks off the DAG at the right time. And the master process is nothing more than a Clock which tells each of the Schedulers that it’s time to update…
Basically, I’m worried about having two classes of DAGs – externally triggered vs “classic” – when there really isn’t a fundamental difference between them. Airflow 1.x is obviously designed around periodic DAG runs, so an externally triggered DAG represents something that needs special handling. But if Airflow 2.x is built around arbitrary run times, then these two DAG types could be unified very nicely. And I like that because they both have the same core purpose: run tasks in order.
Nice ! +1 for REST APIs, it will be helpful to have read/query APIs to list DAGs / jobs and their status etc.
Also, do you think it’s possible that different steps of a DAG may have different resource requirements? Say for example memory ? I am a beginner in python, but I assume it would have a concept of specifying the requirements, just like we can do in java (via Xmx etc.). Further, for distributed resource schedulers (Yarn / mesos / kubernetes / GCE etc.), even cpu cores / disk can be specified. Although I am afraid those specifications might not be applicable for local executors. So the behavior will differ with envts (rather undesirable).
One more thing, do you have plans for UI for defining a DAG? say some sort of drag drop which allows to select from available operators and allows for setting their properties (and for using xcom within DAG)