question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PIN 5: Combining Tasks

See original GitHub issue

PIN-5: Date: 2019-02-20

Author: Chris White

Status

Proposed

Context

Imagine the following typical scenario: a data engineer wants to create a Prefect Flow which routinely migrates some data from S3 to Google Cloud Storage (along with other things). In our current framework, we implicitly recommend the user do something like (pseudo-code):

s3_task = S3Task(..)
gcs_task = GCSTask(...)

with Flow("migration") as f:
    gcs_task(data=s3_task)

This is OK, but imagine the S3 Task returns 10 GB of data, and the user routinely likes using “checkpointing”. In this case, the data coming out of S3 will hit the checkpoint, be shipped off somewhere else (dragging this Flow down), and then have to move around the Dask workers, resulting in large and unnecessary data movement. Moreover, many of these infrastructure / db clients have hooks for large data streams that we can’t take advantage of with this setup.

Another option is for the user to re-implement all the hooks / credentials / etc. for both GCS and S3, resulting in a monster S3toGCSTask. With this pattern, if we have M sources and N sinks, we need to maintain and test M*N different tasks (this is what frameworks like Airflow currently do). We want to avoid this situation. Ideally we should only have to maintain M + N tasks that can be flexibly and powerfully composed in various ways.

Additionally, it would be nice if users could specify that two tasks should run on the same worker, in the same process, and share memory.

Decision

We will implement some sugar which allows users to combine two tasks into a single Task. For example, the imperative version of this might look like (pseudo-code):

class CombinedTask(Task):                                                             
    def __init__(self, first_task: Task, second_task: Task):                        
        self.first_task = first_task                                             
        self.second_task = second_task                                                 
                                                                                   
    def run(self):                                                                 
        inputs = first_task.run()                                                 
        result = second_task.run(inputs)

along with a functional context manager:

with one_task():
    second_task(first_task)

Of course, there is some work that needs to be done under the hood to match inputs / outputs, and allow for calling patterns such as

with one_task():
    second_task(first_task(config="some_setting"), parameter="another_input")

But ultimately, these two tasks would be combined into a single task which is submitted to a single worker.

How many tasks?

This PIN proposes we only support combining two tasks, with our target use case being migrating data. Allowing for arbitrary numbers might encourage an anti-pattern (Prefect generally prefers small, modular tasks), and become a headache to maintain (deciding which arguments to a combined task should actually be combined vs. left as standalone tasks will be tricky).

Consequences

The largest user-facing consequence is that, if a user uses this pattern, they lose any prefect hooks which may occur between the two tasks, such as trigger checks, notifications, state handlers, etc. In my view, this is perfectly OK in certain situations such as this, where the goal is to move data. If something fails, the data is still sitting in S3, and the user just needs the error to debug.

Exposing this pattern to users will certainly appease many of the data engineers we’ve talked to, as well as reduce the load on our system. Additionally, it would allow us to utilize a shared (temporary) filesystem for these connected / combined tasks and connect to different hooks that otherwise wouldn’t be available to us.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:3
  • Comments:19 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
joshmeekcommented, Feb 21, 2019

@cicdw Okay sounds good. Yeah I would like to keep it at two tasks 😄

1reaction
cicdwcommented, Feb 21, 2019

@dylanbhughes good question.

No differently than other tasks; so using my running example, I would probably name this task s3toGCS so that name would appear in the UI, and I’d want to know my standard set of metrics:

  • duration
  • messages from failure
  • etc.

But nothing special otherwise. Combining two Prefect Tasks serves a large purpose of reducing the amount of boilerplate the user has to write, but in this exact instance the two separate processes are safe to be considered a single standalone unit, just like any other task.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PIN-5: Combining Tasks - Prefect Docs
PIN -5: Combining Tasks ... This PIN proposes we only support combining two tasks, with our target use case being migrating data.
Read more >
Teacher's Take-Out: Combination Task Cards - Pinterest
Teacher's Take-Out - sample combinations problems freebie 5th Grade Math, ... These fun Combinations task cards and worksheets include a total of 16...
Read more >
combine multiple task list - Microsoft Community Hub
In our Sharepoint site We are using multiple subsites called development, helpdesk, sales all site has a seprate Task list.
Read more >
skcc321/deploy_pin: pin task around deployment - GitHub
Sometimes we need to execute set of commands (tasks) after/before deployment. Most likely you use migrations for such things, but that is not...
Read more >
Best Practices For Task Management In Microsoft Planner
For example, categories like “quick tasks,” “medium effort,” and “high effort/strategic” might be quite helpful depending on the time ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found