question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Content Gap] Depths first execution priority on sub dags

See original GitHub issue

Use Case

We have a pipeline that processes data with using the map method, and looks roughly like this.

  1. check_jobs_to_process
  2. load_raw_data (much data)
  3. preprocess_raw_data (heavily reduces data)
  4. do_fancy_stuff
  5. push_results

So solids 2-5 get mapped on the amount of jobs from 1 and get never collected at the end. The actual issue with this is, that the current execution order first executes step 2 for all jobs and then 3, 4, 5 for each job. Which means at the moment when solid 2 is finished for all jobs, the amount of data in memory is pretty high, currently we do not have the possibility to add an io_management. Since the data get’s heavily reduced on solid 3 the moment in which the memory usage is high is actually pretty short.

Ideas of Implementation

Add a configuration, or maybe a parameter to the map method to execute all sub dags in depth first instead of executing on width first across all mapped jobs.


Message from the maintainers:

Excited about this feature? Give it a 👍. We factor engagement into prioritization.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:6
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
alangenfeldcommented, Jun 8, 2021

express my concern about how limiting it seems to not be able to start transforming parts of the data until all the data has been extracted

Makes sense, this is a much bigger change for the system so will track in a separate issue.

0reactions
AntonFribergcommented, Jun 8, 2021

Thanks! Utilizing an IO manager would work as long as the data fits in the configured disk.

I am not aware of how the DynamicOutputs builds the execution graph but I need to express my concern about how limiting it seems to not be able to start transforming parts of the data until all the data has been extracted.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Perform an In-Depth Content Gap Analysis
The content gap analysis process involves setting concrete goals, profoundly understanding your buyer personas, mapping the buyer's journey, ...
Read more >
Guide to Gap Analysis with Examples - Smartsheet
Follow our step-by-step guide to performing a gap analysis, and find examples across multiple industries.
Read more >
Airflow: Concurrency Depth first, rather than breadth first?
I used the following trick to achieve that depth-first behaviour. Assign all tasks of your DAG to a single pool (with limited number...
Read more >
Split-Thickness Skin Grafts - StatPearls - NCBI Bookshelf - NIH
Split-thickness skin grafts are typically adherent after 5 to 7 days upon completion of the stages of wound healing. Once the graft has ......
Read more >
How to Do a Content Gap Analysis (Our 7-Step Process)
A content strategy across all a brand's digital properties? A set of content briefs to optimize high-priority existing pages? etc.), a content ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found