Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve flow runner to enable more task parallelism

See original GitHub issue

Description

Flatten has edge cases which sometimes causes tasks to run sequentially. I have identified 2 cases below which demonstrate the unusual behavior. In the test case, I flatten the output of a mapped seed task and execute 3 dependent sleep tasks which are intended to run in parallel. The implementation of the DAG has an influence on whether the job executes in parallel or not

Expected Behavior

fast_flow demonstrates the expected behavior. Upon launch, all three print statements show up immediately. slow_flow adds one additional step by storing flatten into an array, and retrieving it by index. This has the same DAG as fast_flow but this ends up executing sequentially. slow_flow_2 is identical to fast_flow but removes the map task. It is odd that downstream tasks are required for parallel execution.

Reproduction

import time
from prefect import task, Flow, flatten, unmapped
from prefect.engine.executors import LocalDaskExecutor


@task
def seed(i):
    return [i]


@task
def sleep(i, sleep_time):
    print(f'Sleep {sleep_time}')
    time.sleep(sleep_time)

    print(f'Finished {sleep_time}')

    return [i]


@task
def task2(i):
    print(f'Task2 {i}')
    print(f'Finished {i}')
    return [i]


with Flow("slow_flow") as slow_flow:
    start = ['A']
    inputs = flatten(seed.map(start))
    all_paths = []
    for i in (10, 11, 12):
        foo = [flatten(sleep.map(inputs, unmapped(i)))]
        all_paths.append(foo[0])
    sleep_out = flatten(all_paths)
    task2.map(sleep_out)

with Flow("fast_flow") as fast_flow:
    start = ['A']
    inputs = flatten(seed.map(start))

    all_paths = []
    for i in (10, 11, 12):
        all_paths.append(sleep.map(inputs, unmapped(i)))
    sleep_out = flatten(all_paths)
    task2.map(sleep_out)

with Flow("slow_flow_2") as slow_flow_2:
    start = ['A']
    inputs = flatten(seed.map(start))

    all_paths = []
    for i in (10,11,12):
        all_paths.append(sleep.map(inputs, unmapped(i)))


fast_flow.run(executor=LocalDaskExecutor(scheduler='threads'))
#slow_flow.run(executor=LocalDaskExecutor(scheduler='threads'))
#slow_flow_2.run(executor=LocalDaskExecutor(scheduler='threads'))

Environment

Prefect - 0.13.7 python 3.8.2

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:10

Top GitHub Comments

1reaction

madkinszcommented, Oct 11, 2021

Please don’t ping our team directly. We triage handling of these issues internally.

As the warning displays, flow.run() calls when using the DaskExecutor must be guarded by a __main__ block. This is a requirement for how they package scripts to send to workers.

if __name__ == '__main__':
    flow.run(...)

0reactions

github-actions[bot]commented, Dec 5, 2022

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

Top Results From Across the Web

Parallelism within a Prefect flow

Prefect supports fully asynchronous / parallel running of a flow's tasks and the preferred method for doing this is using Dask . By...

Scaling Airflow to optimize performance - Astronomer Docs

Airflow has many parameters that impact its performance. Tuning these settings can impact DAG parsing and task scheduling performance, parallelism in your ...

Tuning parallelism: increase or decrease? - 158236

I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified -...

How to control the parallelism or concurrency of an Airflow ...

Here's an expanded list of configuration options that are available since Airflow v1.10.2. Some can be set on a per-DAG or per-operator ...

How we used parallel CI/CD jobs to increase our productivity

As we use Knapsack to distribute the test files among the parallel jobs, we were able to make more improvements by reducing the...