First class Sub-flow concept
See original GitHub issueArchived from the Prefect Public Slack Community
walter_gillett: Hi - we are building bioinformatics pipelines related to infectious disease. Prefect looks interesting. I am wondering about task grouping (a.k.a. nesting or sub-dags). Each step in our pipeline reads inputs from GCS and writes outputs to GCS. Without task grouping, this will get messy. For example, suppose we have steps 1, 2, and 3, each of which reads one GCS input and writes a GCS output. That yields 9 tasks (3 GCS download, 3 compute, and 3 upload), but we would like to group them into pipeline steps because that’s the essential unit of work. Is there a way to model this in Prefect?
chris: Hi <@UQM4X5RE2>! Apologies if I’m misunderstanding the use case, but it sounds like you only need 3 Prefect Tasks? What is the benefit you hope to achieve by “grouping” tasks without them being realized as true Prefect Tasks?
walter_gillett: Hi <@ULRBLQ19A> - likely I am misunderstanding how Prefect works. Yes, I want only 3 Prefect Tasks. But if I want to use Prefect machinery to conveniently download from GCS, that’s a task (prefect.tasks.google.storage.GCSDownload), same for upload, so I get 9 Tasks, yes? Conceptually there are 3 pipeline steps so I would like the workflow structure to reflect that. I am thinking of this as being like SubDAGs in Airflow (https://www.astronomer.io/guides/subdags/), where aggregating low-level details makes it possible to have a workflow with a higher level of granularity.
walter_gillett: I see related discussion here: https://docs.prefect.io/core/PINs/PIN-05-Combining-Tasks.html and https://github.com/PrefectHQ/prefect/issues/980 . But not sure what the recommendation coming out of that is.
chris: Yea, I think I understand better what you’re referring to now - thanks for that link; correct me if I’m wrong here, but the airflow notion of SubDAG is an API convenience in the UI for seeing task groupings, which makes sense. I don’t think I see any functional difference in the way the DAG behaves between the fully expanded representation and the SubDAG representation.
In Prefect, you can certainly create multiple flows and then link them together using some combination of flow.update
/ flow.set_dependencies
/ flow.root_tasks()
/ flow.terminal_tasks()
but ultimately we haven’t yet exposed an analogous first-class “sub Flow” concept
walter_gillett: Thanks <@ULRBLQ19A> good to know, rolling up flows could be the answer for now. Adding a first-class subflow concept to Prefect would be helpful, but nesting adds complexity so would have to be done carefully - more is not always better. As a side note re Airflow SubDAGs from the article I linked to “Astronomer highly recommends staying away from SubDags. Airflow 1.10 has changed the default SubDag execution method to use the Sequential Executor to work around deadlocks caused by SubDags”.
chris: very interesting; yea I agree this seems like a really convenient abstraction - we’ll definitely look into it! I’ll actually use our bot to archive this thread as a GitHub issue that we can use to track it
chris: <@ULVA73B9P> archive “First class Sub-flow concept”
Issue Analytics
- State:
- Created 4 years ago
- Reactions:9
- Comments:13 (4 by maintainers)
What about decorator concept like that?
Thanks for the response @lauralorenz and for sharing the miro board! Task design vs flow design pros/cons on page 3 is really interesting. I’d like to learn more about why some of these pros/cons are there. To address your question, the design I have in mind is a kind of mix where the flow object is callable like a task. Something like
Where
ab_flow(a, b)
can also be called in another flow. I think this design would make it easier to modularize flows and keep tasks tiny. I feel like it will also be more explicit what code is run when this way vs calling a task from another task. I’m not sure what the pros/cons are with this approach or whether it’s technically feasible. What do you think?