Customizing logic using Python
See original GitHub issueCurrently, there is no way to customize ploomber build
behavior, but in some cases, a user may want to override the default rules (i.e. execute a task when the code changes). This is possible by loading the pipeline.yaml
into Python and then modifying the DAG
object. This issue discusses some use cases as well as what we need to do to officially support this.
Please comment if these examples solve your problem or if you have any other use cases we should also consider
We’ve got a few example use cases:
Skip a branch based on an input parameter
Say the pipeline looks like this:
graph LR;
A-->B;
A-->C;
B-->D;
C-->E;
In some cases, we may want to skip an entire branch (e.g. B -> D) based on an input parameter
# download example
ploomber examples -n cookbook/python-load --branch custom-logic -o python-load
cd python-load
What’s missing?
- While this is technically possible by deleting tasks one by one (e.g.
del dag['B']
), there isn’t a simple way to delete an entire branch, so we should add a few handy methods (e.g.,dag.delete_branch('B')
) - Add a cookbook example
Customize caching logic
With long-running tasks, users may want to skip execution even if the code has slightly changed, or even apply custom rules, this usually happens with data ingestion tasks. It’s possible to achieve that using private APIs
# download example
ploomber examples -n cookbook/python-load --branch custom-logic -o python-load
cd python-load
What’s missing
- We are lacking documentation on
TaskStatus
and currently the only way to achieve this is via thedag._params.cache_rendered_status
private, we should make this a public API. - Add a cookbook example
TO DO:
- Add tutorial showing how to use Ploomber’s CLI to call a factory entry point where the pipeline loads from a
pipeline.yaml
andenv.yaml
(show that cli args and cell injection work) - Add a link to the tutorial from the point above to the cookbook that shows how to load the dag using Python
Issue Analytics
- State:
- Created a year ago
- Comments:13 (5 by maintainers)
Yes, as the name suggests,
--force
should run everything regardless of anything. The case where we want to skip something “artificially” (that is, modifying Ploomber’s standard behavior) is highly dependent on the use case. I think it’s best to keep the current behavior (--force
executes everything) and have users determine how they want to customize if needed - in 90% of the cases, the default behavior works.For example, someone may decide to apply some very custom rules, and they can turn them on/off by adding an argument to the custom entry point:
Then the
custom_stuff
becomes accessible in the CLILploomber build -e pipeline.make --custom-flag
The benefit of using the Python API is that you can define your logic: as you mention @mitch-at-orika, you can even write some logic that takes a JSON file and determines that task’s status based on that.
Thanks for the feedback! I see a lot of value in explaining users how to customize the DAG execution logic so I’ll add a tutorial showing all the things we’ve discussed here.
awesome, thanks for the feedback! I’ll work on adding a bit more details to the examples I provided, merge them to the master branch, and open some issues to tackle the use of private APIs.