Tracking changes in external files
See original GitHub issueIncremental builds allow users to quickly iterate since Ploomber takes care of only executing tasks whose source code or parameters have changed since the last run. However, if the source code is loading external files, changes to them are not detected:
from pathlib import Path
def my_task(product, upstream):
# changes to some/path.json are not detected!
content = Path('some/path.json').read_text()
# do stuff
# ...
Path(product).write_text(output)
We’re looking for a way to enhance this functionality in the simplest possible way for the user. The cleanest approach we’ve found so far is to embed this logic in tasks[*].params
:
# pipeline.yaml
tasks:
- source: my_module.my_task
product: output.txt
params:
json_file: some/path.json
Then our code would look like this:
def my_task(product, upstream, json_file):
content = Path(json_file).read_text()
# do stuff
# ...
Path(product).write_text(output)
The main benefit is that this is task-agnostic: it works the same whether it’s a function, script, or notebook.
However, for a given task, not all parameters are paths to files. Users may not want to trigger task execution on changes to all external files, so they need a way to distinguish between params and files that trigger task execution.
Option 1: Naming convention
One way to achieve this without any API changes is to have a naming convention. Say, add a suffix (e.g., resource
to tell Ploomber also to track files content):
# pipeline.yaml
tasks:
- source: my_module.my_task
product: output.txt
params:
# track contents of this file
json_file: some/path-resource.json
# do not track this
json_file_another: another/path.json
Option 2: Special type of parameters
Alternatively, we may define a special type of param:
# pipeline.yaml
tasks:
- source: my_module.my_task
product: output.txt
params:
# track contents of this file
json_file:
# special type of parameter defined by the resource key
resource: some/path.json
# do not track this
json_file_another: another/path.json
Important considerations
- File size: To keep the metadata file size small, we should only save the file hash, not its contents
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (5 by maintainers)
If you’re interested in the detecting changes on imported modules issue, check out #111
While working on the implementation, I thought of an alternative: having a sub-section under params.
What do you think? @filipj8 @fferegrino