Make FileChangedChecker and Dependency classes more flexible to allow for cloud based file dependencies
See original GitHub issueDescription
Enable doit to track file dependencies for remote file systems such as AWS S3.
doit tracks file dependencies via file_dep
in the task dictionary. File dependencies are checked via the dependency.FileChangedCheckerClass
, which indicates whether a file is up to date or not. The FileChangedChecker
class if of the form
class FileChangedChecker:
def check_modified(self, file_path, file_stat, state):
...
return True # if file is modified
def get_state(self, dep, current_state):
state = ...
return state
doit provides two builtin implementations, MD5FileChangedChecker
and TimestampFileChangedChecker
, which respectively use the md5 hash and the timestamp to check if a file has changed.
The class is customizable, the user can provide a CustomFileChangedChecker class as documented here https://pydoit.org/cmd_run.html#custom-check-file-uptodate:
DOIT_CONFIG = {'check_file_uptodate': CustomFileChangedChecker}
This should provide a route to handle remote filesystems such as AWS S3. However, in doit’s latest version (v0.33), the file_stat
argument that is passed to FileChangedChecker.check_modified
is computed by os.stat
in dependency.Depenency
. In addition, dependency.Dependency
uses os.path.exists
to check if a target exists. The dependency.Dependency
class isn’t itself customizable so AFAICT both calls should be updated to enable handling remote storage/filesystems such as AWS S3 via a custom FileChangedChecker
.
Proposed implementation
Add two methods to FileChangedChecker
, exists
and info
that can be called by dependency.Dependency
instead of the currently hard coded os.path.exists
and os.stat
.
class FileChangedChecker:
def exists(self, file_path):
"""Return True if file_path exists, False otherwise."""
return os.path.exists(file_path) # default implementation
def info(self, file_path):
"""Return some metadat about the file at file_path."""
return os.stat(file_path) # default implementation
def check_modified(self, file_path, file_stat, state):
...
return True # if file is modified
def get_state(self, dep, current_state):
state = ...
return state
See PR https://github.com/pydoit/doit/pull/407
Example dodo.py
"""
Example dodo.file with S3 file_dep.
"""
from doit.tools import Interactive
from doit.dependency import FileChangedChecker
import s3fs # TODO: should just use boto3 here to keep things simple
import os
class S3FileChangedChecker(FileChangedChecker):
"""Check if S3 File is up to date
Assumes that FileChangedChecker has two additional methods
`exists` and `info` that are called within
`doit.dependency.Dependency` instead of `os.path.exists` and
`os.stat`.
See https://github.com/pydoit/doit/pull/407
"""
CheckerError = FileNotFoundError
def exists(self, file_path):
# target might be local so handle both cases
if file_path.startswith('s3://'):
fs = s3fs.S3FileSystem()
return fs.exists(file_path)
else:
return os.path.exists(file_path)
def info(self, file_path):
# for now this assumes the file is in s3
fs = s3fs.S3FileSystem()
raw_info = fs.info(file_path)
# make sure the result is JSON serializable
return {key: str(value) for (key, value) in raw_info.items()}
def check_modified(self, file_path, file_stat, state):
"""Check if file in file_path is modified from previous "state".
file_path (string): file path
file_stat: result of os.stat() of file_path
state: state that was previously saved with ``get_state()``
returns (bool): True if dep is modified
"""
if file_stat['ETag'] != state['ETag']:
return True
return False
def get_state(self, dep, current_state):
"""Compute the state of a task after it has been successfully executed.
dep (str): path of the dependency file.
current_state (tuple): the current state, saved from a previous
execution of the task (None if the task was never run).
returns (dict|None): the new state. Return None if the state is unchanged.
state is of the form
{
'ETag': '"ed076287532e86365e841e92bfc50d8c"',
'Key': 'ssr-scratch/hello.txt',
'LastModified': datetime.datetime(2021, 11, 27, 17, 54, 45, tzinfo=tzutc()),
'Size': 12,
'size': 12,
'name': 'ssr-scratch/hello.txt',
'type': 'file',
'StorageClass': 'STANDARD',
'VersionId': None
}
The parameter `current_state` is passed to allow speed optimization,
see MD5Checker.get_state().
""" # noqa
state = self.info(dep)
if current_state and (
current_state['LastModified'] == state['LastModified']):
# state not change, returning None
return
else:
return state
DOIT_CONFIG = {
'check_file_uptodate': S3FileChangedChecker
}
def task_s3_download():
"""Download file from s3"""
return {
'file_dep': ['s3://ssr-scratch/remote_hello.txt'],
'targets': ['local_hello.txt'],
'actions': [Interactive(
'aws s3 cp s3://ssr-scratch/remote_hello.txt local_hello.txt'
)],
}
Alternatives
Use a custom uptodate checker and define dependencies between tasks using task_dep
. An example implementation of the uptodate checker is given below. This doesn’t require any change to pydoit to handle S3 file_deps or targets (or even a mix of local and S3 files). Sort of the breaks the “there should be only one way to do it” principle, so using “proper” file_deps and targets as shown above is probably preferable?
import os
from doit.dependency import get_file_md5
import s3fs # TODO: should just use boto3 here to keep things simple
class FileDepsAndTargetsUptodate:
def __init__(self, file_deps=None, targets=None):
"""Check whether all input file_deps and targets are up to date
file_deps (List[str]): input paths. Local filesystem or S3.
targets (List[str]): output paths. Local filesystem or S3.
"""
self.file_deps = file_deps or []
self.targets = targets or []
def __call__(self, task, values):
def save_hashes():
hashes = {
'path': self._info(path)['hash'] for path in self.file_deps
}
return {'hashes': hashes}
print(task)
task.value_savers.append(save_hashes)
# If any target path is missing, file isn't up to date
if not all(self._exists(path) for path in self.targets):
return False
# If any input file's changed, file isn't up to date
previous_hash = values.get('hashes', {})
if not previous_hash:
return False
else:
is_uptodate = {
path: previous_hash['path'] == self._info(path).get('hash')
for path in self.file_deps
}
return all(is_uptodate.values())
def _exists(self, file_path):
if file_path.startswith('s3://'):
fs = s3fs.S3FileSystem()
return fs.exists(file_path)
else:
return os.path.exists(file_path)
def _info(self, file_path):
if file_path.startswith('s3://'):
fs = s3fs.S3FileSystem()
raw_info = fs.info(file_path)
return {
'hash': raw_info['ETag'],
'type': 'ETag',
'orig_info': raw_info
}
else:
stats = os.stat(file_path)
return {
'hash': get_file_md5(file_path),
'type': 'MD5',
'orig_info': stats
}
# example dodo task using uptodate
def task_s3_download():
"""Download file from s3"""
return {
'actions': [Interactive(
'aws s3 cp s3://ssr-scratch/remote_hello.txt local_hello.txt'
)],
'uptodate': [
FileDepsAndTargetsUptodate(file_deps=['s3://ssr-scratch/remote_hello.txt'], targets=['local_hello.txt'])
]
}
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
Check the “reporter” plugin. should be similar… Not really required. Since you can set the Checker on dodo.py.
GIST link https://gist.github.com/samuelsinayoko/8c4f5da30132a099c3decf49849c59d8 to S3FileChangedChecker implementation