question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make FileChangedChecker and Dependency classes more flexible to allow for cloud based file dependencies

See original GitHub issue

Description

Enable doit to track file dependencies for remote file systems such as AWS S3.

doit tracks file dependencies via file_dep in the task dictionary. File dependencies are checked via the dependency.FileChangedCheckerClass, which indicates whether a file is up to date or not. The FileChangedChecker class if of the form

class FileChangedChecker:
    def check_modified(self, file_path, file_stat, state):
        ...
        return True  # if file is modified
    def get_state(self, dep, current_state):
        state = ...
        return state

doit provides two builtin implementations, MD5FileChangedChecker and TimestampFileChangedChecker, which respectively use the md5 hash and the timestamp to check if a file has changed.

The class is customizable, the user can provide a CustomFileChangedChecker class as documented here https://pydoit.org/cmd_run.html#custom-check-file-uptodate:

DOIT_CONFIG = {'check_file_uptodate': CustomFileChangedChecker}

This should provide a route to handle remote filesystems such as AWS S3. However, in doit’s latest version (v0.33), the file_stat argument that is passed to FileChangedChecker.check_modified is computed by os.stat in dependency.Depenency. In addition, dependency.Dependency uses os.path.exists to check if a target exists. The dependency.Dependency class isn’t itself customizable so AFAICT both calls should be updated to enable handling remote storage/filesystems such as AWS S3 via a custom FileChangedChecker.

Proposed implementation

Add two methods to FileChangedChecker, exists and info that can be called by dependency.Dependency instead of the currently hard coded os.path.exists and os.stat.

class FileChangedChecker:
	def exists(self, file_path):
	    """Return True if file_path exists, False otherwise."""
		return os.path.exists(file_path)   # default implementation

	def info(self, file_path):
	    """Return some metadat about the file at file_path."""
		return os.stat(file_path)         # default implementation

    def check_modified(self, file_path, file_stat, state):
        ...
        return True  # if file is modified
    def get_state(self, dep, current_state):
        state = ...
        return state

See PR https://github.com/pydoit/doit/pull/407

Example dodo.py

"""
Example dodo.file with S3 file_dep.
"""
from doit.tools import Interactive
from doit.dependency import FileChangedChecker
import s3fs  # TODO: should just use boto3 here to keep things simple
import os


class S3FileChangedChecker(FileChangedChecker):
    """Check if S3 File is up to date

    Assumes that FileChangedChecker has two additional methods
    `exists` and `info` that are called within
    `doit.dependency.Dependency` instead of `os.path.exists` and
    `os.stat`.
    See https://github.com/pydoit/doit/pull/407
    """
    CheckerError = FileNotFoundError

    def exists(self, file_path):
        # target might be local so handle both cases
        if file_path.startswith('s3://'):
            fs = s3fs.S3FileSystem()
            return fs.exists(file_path)
        else:
            return os.path.exists(file_path)

    def info(self, file_path):
        # for now this assumes the file is in s3
        fs = s3fs.S3FileSystem()
        raw_info = fs.info(file_path)
        # make sure the result is JSON serializable
        return {key: str(value) for (key, value) in raw_info.items()}

    def check_modified(self, file_path, file_stat, state):
        """Check if file in file_path is modified from previous "state".

        file_path (string): file path
        file_stat: result of os.stat() of file_path
        state: state that was previously saved with ``get_state()``

        returns (bool): True if dep is modified
        """
        if file_stat['ETag'] != state['ETag']:
            return True

        return False

    def get_state(self, dep, current_state):
        """Compute the state of a task after it has been successfully executed.

        dep (str): path of the dependency file.
        current_state (tuple): the current state, saved from a previous
            execution of the task (None if the task was never run).

        returns (dict|None): the new state. Return None if the state is unchanged.
           state is of the form
           {
            'ETag': '"ed076287532e86365e841e92bfc50d8c"',
            'Key': 'ssr-scratch/hello.txt',
            'LastModified': datetime.datetime(2021, 11, 27, 17, 54, 45, tzinfo=tzutc()),
            'Size': 12,
            'size': 12,
            'name': 'ssr-scratch/hello.txt',
            'type': 'file',
            'StorageClass': 'STANDARD',
            'VersionId': None
           }

        The parameter `current_state` is passed to allow speed optimization,
        see MD5Checker.get_state().
        """  # noqa
        state = self.info(dep)
        if current_state and (
                current_state['LastModified'] == state['LastModified']):
            # state not change, returning None
            return
        else:
            return state


DOIT_CONFIG = {
    'check_file_uptodate': S3FileChangedChecker
}


def task_s3_download():
    """Download file from s3"""
    return {
        'file_dep': ['s3://ssr-scratch/remote_hello.txt'],
        'targets': ['local_hello.txt'],
        'actions': [Interactive(
            'aws s3 cp  s3://ssr-scratch/remote_hello.txt local_hello.txt'
        )],
    }

Alternatives

Use a custom uptodate checker and define dependencies between tasks using task_dep. An example implementation of the uptodate checker is given below. This doesn’t require any change to pydoit to handle S3 file_deps or targets (or even a mix of local and S3 files). Sort of the breaks the “there should be only one way to do it” principle, so using “proper” file_deps and targets as shown above is probably preferable?

import os
from doit.dependency import get_file_md5
import s3fs  # TODO: should just use boto3 here to keep things simple

class FileDepsAndTargetsUptodate:
    def __init__(self, file_deps=None, targets=None):
        """Check whether all input file_deps and targets are up to date

        file_deps (List[str]): input paths. Local filesystem or S3.
        targets (List[str]): output paths. Local filesystem or S3.
        """
        self.file_deps = file_deps or []
        self.targets = targets or []

    def __call__(self, task, values):

        def save_hashes():
            hashes = {
                'path': self._info(path)['hash'] for path in self.file_deps
            }
            return {'hashes': hashes}

        print(task)
        task.value_savers.append(save_hashes)

        # If any target path is missing, file isn't up to date
        if not all(self._exists(path) for path in self.targets):
            return False

        # If any input file's changed, file isn't up to date
        previous_hash = values.get('hashes', {})
        if not previous_hash:
            return False
        else:
            is_uptodate = {
                path: previous_hash['path'] == self._info(path).get('hash')
                for path in self.file_deps
            }
            return all(is_uptodate.values())

    def _exists(self, file_path):
        if file_path.startswith('s3://'):
            fs = s3fs.S3FileSystem()
            return fs.exists(file_path)
        else:
            return os.path.exists(file_path)

    def _info(self, file_path):
        if file_path.startswith('s3://'):
            fs = s3fs.S3FileSystem()
            raw_info = fs.info(file_path)
            return {
                'hash': raw_info['ETag'],
                'type': 'ETag',
                'orig_info': raw_info
            }
        else:
            stats = os.stat(file_path)
            return {
                'hash': get_file_md5(file_path),
                'type': 'MD5',
                'orig_info': stats
            }

# example dodo task using uptodate
def task_s3_download():
    """Download file from s3"""
    return {
       'actions': [Interactive(
            'aws s3 cp  s3://ssr-scratch/remote_hello.txt local_hello.txt'
        )],
       'uptodate': [
           FileDepsAndTargetsUptodate(file_deps=['s3://ssr-scratch/remote_hello.txt'], targets=['local_hello.txt'])
       ]
    }

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
schettino72commented, Dec 9, 2021

Check the “reporter” plugin. should be similar… Not really required. Since you can set the Checker on dodo.py.

0reactions
samuelsinayokocommented, Dec 12, 2021

ok. merged that.

Do you plan to put this S3FileChangedChecker on a package or GIST? I guess it would be nice to integrate this in one way or another into the docs.

GIST link https://gist.github.com/samuelsinayoko/8c4f5da30132a099c3decf49849c59d8 to S3FileChangedChecker implementation

Read more comments on GitHub >

github_iconTop Results From Across the Web

Custom up-to-date strategy · Issue #397 · pydoit/doit - GitHub
Make FileChangedChecker and Dependency classes more flexible to allow for cloud based file dependencies #408.
Read more >
Cloud-Native Applications and Managing Their Dependencies
Each service can be developed and deployed independently, making the overall process more flexible and scalable.
Read more >
Specifying dependencies - App Engine - Google Cloud
You can declare dependencies for PHP in a standard composer.json file. For example: ... You can use any web framework with App Engine...
Read more >
How to resolve class file for com.google.cloud.Service not found
If someone is using module-info.java then you need to have these two dependencies: <dependency> <groupId>com.google.cloud</groupId> ...
Read more >
IaaS versus PaaS versus SaaS - IBM
Understand the IaaS, PaaS and SaaS cloud service models and their benefits. ... IaaS gives customers more flexibility build out computing ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found