question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tracking changes in external files

See original GitHub issue

Incremental builds allow users to quickly iterate since Ploomber takes care of only executing tasks whose source code or parameters have changed since the last run. However, if the source code is loading external files, changes to them are not detected:

from pathlib import Path

def my_task(product, upstream):
    # changes to some/path.json are not detected!
    content = Path('some/path.json').read_text()
    # do stuff
    # ...
    Path(product).write_text(output)

We’re looking for a way to enhance this functionality in the simplest possible way for the user. The cleanest approach we’ve found so far is to embed this logic in tasks[*].params:

# pipeline.yaml
tasks:
    - source: my_module.my_task
      product: output.txt
      params:
        json_file: some/path.json

Then our code would look like this:

def my_task(product, upstream, json_file):
    content = Path(json_file).read_text()
    # do stuff
    # ...
    Path(product).write_text(output)

The main benefit is that this is task-agnostic: it works the same whether it’s a function, script, or notebook.

However, for a given task, not all parameters are paths to files. Users may not want to trigger task execution on changes to all external files, so they need a way to distinguish between params and files that trigger task execution.

Option 1: Naming convention

One way to achieve this without any API changes is to have a naming convention. Say, add a suffix (e.g., resource to tell Ploomber also to track files content):

# pipeline.yaml
tasks:
    - source: my_module.my_task
      product: output.txt
      params:
        # track contents of this file
        json_file: some/path-resource.json
        # do not track this
        json_file_another: another/path.json

Option 2: Special type of parameters

Alternatively, we may define a special type of param:

# pipeline.yaml
tasks:
    - source: my_module.my_task
      product: output.txt
      params:
        # track contents of this file
        json_file:
            # special type of parameter defined by the resource key
            resource: some/path.json
        # do not track this
        json_file_another: another/path.json

Important considerations

  1. File size: To keep the metadata file size small, we should only save the file hash, not its contents

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
edublancascommented, Jul 23, 2021

If you’re interested in the detecting changes on imported modules issue, check out #111

0reactions
edublancascommented, Jul 24, 2021

While working on the implementation, I thought of an alternative: having a sub-section under params.

# pipeline.yaml
tasks:
    - source: my_module.my_task
      product: output.txt
      params:
        # track everything here....
        resources:
          json_file: some/path.json
        # regular, untracked params...
        another_param: 1
from pathlib import Path

def my_task(product, upstream, resources):
    content = Path(resources['json_file']).read_text()
    # do stuff
    # ...
    Path(product).write_text(output)

What do you think? @filipj8 @fferegrino

Read more comments on GitHub >

github_iconTop Results From Across the Web

Track changes in Word - Microsoft Support
To track only your own changes - On the Review tab, select Track Changes > Just Mine. To track everyone's changes - On...
Read more >
How To Check If a File Has Been Edited - GoldFynch Blog
1. Microsoft** Word's **'track changes' & 'compare.' · 2. Google Docs' 'version history' tool. · 3. Adobe Acrobat's 'compare' tool for PDFs.
Read more >
Recording Changes to the Repository - Git SCM
The CONTRIBUTING.md file appears under a section named “Changes not staged for commit” — which means that a file that is tracked has...
Read more >
Track changes for users and shared drives - Google Developers
For Google Drive apps that need to keep track of changes to items in Drive, the Changes collection provides an efficient way to...
Read more >
Tracking changed files through Git on External Hard Drive
I've downloaded it to my external drive, but when i plug it in Computer 2 - Git detects all files as Changed. The...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found