Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Customizing logic using Python

See original GitHub issue

Currently, there is no way to customize ploomber build behavior, but in some cases, a user may want to override the default rules (i.e. execute a task when the code changes). This is possible by loading the pipeline.yaml into Python and then modifying the DAG object. This issue discusses some use cases as well as what we need to do to officially support this.

Please comment if these examples solve your problem or if you have any other use cases we should also consider

We’ve got a few example use cases:

Skip a branch based on an input parameter

Say the pipeline looks like this:

graph LR;
    A-->B;
    A-->C;
    B-->D;
    C-->E;

In some cases, we may want to skip an entire branch (e.g. B -> D) based on an input parameter

Working example

# download example
ploomber examples -n cookbook/python-load --branch custom-logic -o python-load
cd python-load

What’s missing?

While this is technically possible by deleting tasks one by one (e.g. del dag['B']), there isn’t a simple way to delete an entire branch, so we should add a few handy methods (e.g., dag.delete_branch('B'))
Add a cookbook example

Customize caching logic

With long-running tasks, users may want to skip execution even if the code has slightly changed, or even apply custom rules, this usually happens with data ingestion tasks. It’s possible to achieve that using private APIs

Working example

# download example
ploomber examples -n cookbook/python-load --branch custom-logic -o python-load
cd python-load

What’s missing

We are lacking documentation on TaskStatus and currently the only way to achieve this is via the dag._params.cache_rendered_status private, we should make this a public API.
Add a cookbook example

TO DO:

Add tutorial showing how to use Ploomber’s CLI to call a factory entry point where the pipeline loads from a pipeline.yaml and env.yaml (show that cli args and cell injection work)
Add a link to the tutorial from the point above to the cookbook that shows how to load the dag using Python

Issue Analytics

State:
Created a year ago
Comments:13 (5 by maintainers)

Top GitHub Comments

1reaction

edublancascommented, Apr 1, 2022

Yes, as the name suggests, --force should run everything regardless of anything. The case where we want to skip something “artificially” (that is, modifying Ploomber’s standard behavior) is highly dependent on the use case. I think it’s best to keep the current behavior (--force executes everything) and have users determine how they want to customize if needed - in 90% of the cases, the default behavior works.

For example, someone may decide to apply some very custom rules, and they can turn them on/off by adding an argument to the custom entry point:

@with_env('env.yaml')
def make(env, custom_flag=False):
    # use custom_stuff to manually determine task status (override default behavior)
    dag = DAGSpec('pipeline.yaml', env=dict(env)).to_dag()
    return dag

Then the custom_stuff becomes accessible in the CLIL ploomber build -e pipeline.make --custom-flag

The benefit of using the Python API is that you can define your logic: as you mention @mitch-at-orika, you can even write some logic that takes a JSON file and determines that task’s status based on that.

Thanks for the feedback! I see a lot of value in explaining users how to customize the DAG execution logic so I’ll add a tutorial showing all the things we’ve discussed here.

1reaction

edublancascommented, Mar 28, 2022

awesome, thanks for the feedback! I’ll work on adding a bit more details to the examples I provided, merge them to the master branch, and open some issues to tackle the use of private APIs.