Support for caching of data files
See original GitHub issueA commonly request use case for an Intake catalog is a hybrid between two currently supported options:
- Fully local data: the data files are packaged with the catalog file on the filesystem
- Fully remote data: the data files are streamed from a remote source (HTTP, S3, GCP, HDFS) on every access
The hybrid case would be to copy the data files from the remote source once (either on first access, or with an explicit “prefetch” command), and then access them locally from then on. This would both reduce the load on public data servers (by avoiding repeated downloads) and increase loading performance after that first copy.
There are some tricky issues to solve here, but here’s one possible approach:
The catalog syntax is extended so that catalog entries can have an optional cache
section:
sources:
trips:
description: taxi trips
driver: csv
cache:
- src: s3://example.com/data/2017-01/
dest: taxi/2017-01/
pattern: *.csv
args:
urlpath: {{ INTAKE_CACHE }}/taxi/2017-01/*.csv
The cache
section contains a list of dictionaries with src
, dest
and pattern
keys. Files will be copied from [src]/[pattern]
to [INTAKE_CACHE]/[dest]/[pattern]
(where [INTAKE_CACHE]
is picked to be hidden in the user home directory unless overridden) when the catalog entry is first opened.
Some issues that need to be decided:
- Checksum verification? - Do we want to have an extra option to set SHA256 hashes for downloaded files? Maybe not yet?
- Cache expiration? Should we assume that once downloaded, the cache for a catalog entry never needs to be refreshed? Use a “version number” in the
cache
section (similar in concept to a conda build number) to allow updated catalogs to force the cache to refresh for that entry? - Control of cache location: We could add a section to the catalog metadata section (#75) to override where cached data is stored. Otherwise, we can use appdirs to pick a suitable location.
- How to handle dask? By definition, cached data will be on the client, so we would want to force dask to stream it from the client. (Not sure how to do that most easily.)
- Intake CLI extensions: At a minimum, we’ll need a command to prepopulate the cache for all entries in a given catalog file (or the global catalog space), and a command to clear the cache.
Other issues I might have forgotten?
(Tagging @martindurant and @alimanfoo)
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:7 (5 by maintainers)
Possible spec: https://gist.github.com/martindurant/bc38d581cd3f9d444a656cbceae9d8ba
Since Intake will be downloading files on your behalf, I can imagine some people might want to control where those cached files are written. A reasonable default would be something like .local/share/intake/cache/ on Linux, but maybe that isn’t a good idea in some specific situations.
Yes, would be useful to be able to control the cache location in some cases.