question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve flow runner to enable more task parallelism

See original GitHub issue

Description

Flatten has edge cases which sometimes causes tasks to run sequentially. I have identified 2 cases below which demonstrate the unusual behavior. In the test case, I flatten the output of a mapped seed task and execute 3 dependent sleep tasks which are intended to run in parallel. The implementation of the DAG has an influence on whether the job executes in parallel or not

Expected Behavior

fast_flow demonstrates the expected behavior. Upon launch, all three print statements show up immediately. slow_flow adds one additional step by storing flatten into an array, and retrieving it by index. This has the same DAG as fast_flow but this ends up executing sequentially. slow_flow_2 is identical to fast_flow but removes the map task. It is odd that downstream tasks are required for parallel execution.

image

Reproduction

import time
from prefect import task, Flow, flatten, unmapped
from prefect.engine.executors import LocalDaskExecutor


@task
def seed(i):
    return [i]


@task
def sleep(i, sleep_time):
    print(f'Sleep {sleep_time}')
    time.sleep(sleep_time)

    print(f'Finished {sleep_time}')

    return [i]


@task
def task2(i):
    print(f'Task2 {i}')
    print(f'Finished {i}')
    return [i]


with Flow("slow_flow") as slow_flow:
    start = ['A']
    inputs = flatten(seed.map(start))
    all_paths = []
    for i in (10, 11, 12):
        foo = [flatten(sleep.map(inputs, unmapped(i)))]
        all_paths.append(foo[0])
    sleep_out = flatten(all_paths)
    task2.map(sleep_out)

with Flow("fast_flow") as fast_flow:
    start = ['A']
    inputs = flatten(seed.map(start))

    all_paths = []
    for i in (10, 11, 12):
        all_paths.append(sleep.map(inputs, unmapped(i)))
    sleep_out = flatten(all_paths)
    task2.map(sleep_out)

with Flow("slow_flow_2") as slow_flow_2:
    start = ['A']
    inputs = flatten(seed.map(start))

    all_paths = []
    for i in (10,11,12):
        all_paths.append(sleep.map(inputs, unmapped(i)))


fast_flow.run(executor=LocalDaskExecutor(scheduler='threads'))
#slow_flow.run(executor=LocalDaskExecutor(scheduler='threads'))
#slow_flow_2.run(executor=LocalDaskExecutor(scheduler='threads'))

Environment

Prefect - 0.13.7 python 3.8.2

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:5
  • Comments:10

github_iconTop GitHub Comments

1reaction
madkinszcommented, Oct 11, 2021

Please don’t ping our team directly. We triage handling of these issues internally.

As the warning displays, flow.run() calls when using the DaskExecutor must be guarded by a __main__ block. This is a requirement for how they package scripts to send to workers.

if __name__ == '__main__':
    flow.run(...)
0reactions
github-actions[bot]commented, Dec 5, 2022

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallelism within a Prefect flow
Prefect supports fully asynchronous / parallel running of a flow's tasks and the preferred method for doing this is using Dask . By...
Read more >
Scaling Airflow to optimize performance - Astronomer Docs
Airflow has many parameters that impact its performance. Tuning these settings can impact DAG parsing and task scheduling performance, parallelism in your ...
Read more >
Tuning parallelism: increase or decrease? - 158236
I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified -...
Read more >
How to control the parallelism or concurrency of an Airflow ...
Here's an expanded list of configuration options that are available since Airflow v1.10.2. Some can be set on a per-DAG or per-operator ...
Read more >
How we used parallel CI/CD jobs to increase our productivity
As we use Knapsack to distribute the test files among the parallel jobs, we were able to make more improvements by reducing the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found