[Content Gap] Depths first execution priority on sub dags
See original GitHub issueUse Case
We have a pipeline that processes data with using the map
method, and looks roughly like this.
- check_jobs_to_process
- load_raw_data (much data)
- preprocess_raw_data (heavily reduces data)
- do_fancy_stuff
- push_results
So solids 2-5 get mapped on the amount of jobs from 1 and get never collected at the end.
The actual issue with this is, that the current execution order first executes step 2 for all jobs and then 3, 4, 5 for each job. Which means at the moment when solid 2 is finished for all jobs, the amount of data in memory is pretty high, currently we do not have the possibility to add an io_management
.
Since the data get’s heavily reduced on solid 3 the moment in which the memory usage is high is actually pretty short.
Ideas of Implementation
Add a configuration, or maybe a parameter to the map
method to execute all sub dags in depth first instead of executing on width first across all mapped jobs.
Message from the maintainers:
Excited about this feature? Give it a 👍. We factor engagement into prioritization.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:6
- Comments:10 (5 by maintainers)
Top GitHub Comments
Makes sense, this is a much bigger change for the system so will track in a separate issue.
Thanks! Utilizing an IO manager would work as long as the data fits in the configured disk.
I am not aware of how the DynamicOutputs builds the execution graph but I need to express my concern about how limiting it seems to not be able to start transforming parts of the data until all the data has been extracted.