question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for caching of data files

See original GitHub issue

A commonly request use case for an Intake catalog is a hybrid between two currently supported options:

  • Fully local data: the data files are packaged with the catalog file on the filesystem
  • Fully remote data: the data files are streamed from a remote source (HTTP, S3, GCP, HDFS) on every access

The hybrid case would be to copy the data files from the remote source once (either on first access, or with an explicit “prefetch” command), and then access them locally from then on. This would both reduce the load on public data servers (by avoiding repeated downloads) and increase loading performance after that first copy.

There are some tricky issues to solve here, but here’s one possible approach:

The catalog syntax is extended so that catalog entries can have an optional cache section:

sources:
  trips:
    description: taxi trips
    driver: csv
    cache:
        - src: s3://example.com/data/2017-01/
          dest: taxi/2017-01/
          pattern: *.csv
    args:
        urlpath: {{ INTAKE_CACHE }}/taxi/2017-01/*.csv

The cache section contains a list of dictionaries with src, dest and pattern keys. Files will be copied from [src]/[pattern] to [INTAKE_CACHE]/[dest]/[pattern] (where [INTAKE_CACHE] is picked to be hidden in the user home directory unless overridden) when the catalog entry is first opened.

Some issues that need to be decided:

  • Checksum verification? - Do we want to have an extra option to set SHA256 hashes for downloaded files? Maybe not yet?
  • Cache expiration? Should we assume that once downloaded, the cache for a catalog entry never needs to be refreshed? Use a “version number” in the cache section (similar in concept to a conda build number) to allow updated catalogs to force the cache to refresh for that entry?
  • Control of cache location: We could add a section to the catalog metadata section (#75) to override where cached data is stored. Otherwise, we can use appdirs to pick a suitable location.
  • How to handle dask? By definition, cached data will be on the client, so we would want to force dask to stream it from the client. (Not sure how to do that most easily.)
  • Intake CLI extensions: At a minimum, we’ll need a command to prepopulate the cache for all entries in a given catalog file (or the global catalog space), and a command to clear the cache.

Other issues I might have forgotten?

(Tagging @martindurant and @alimanfoo)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:5
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

0reactions
alimanfoocommented, May 2, 2018

Since Intake will be downloading files on your behalf, I can imagine some people might want to control where those cached files are written. A reasonable default would be something like .local/share/intake/cache/ on Linux, but maybe that isn’t a good idea in some specific situations.

Yes, would be useful to be able to control the cache location in some cases.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is Caching and How it Works - AWS
In computing, a cache is a high-speed data storage layer which stores a subset of data, typically transient in nature, so that future...
Read more >
File system caching configurations - IBM
This behavior of caching data at the file system level is reflected in the FILE SYSTEM CACHING clause of the CREATE TABLESPACE statement....
Read more >
Caching guidance - Azure Architecture Center | Microsoft Learn
Learn how caching can improve the performance and scalability of a system by copying frequently accessed data to fast storage close to the...
Read more >
What is Cache (Computing)? - TechTarget
Performance. Storing data in a cache allows a computer to run faster. For example, a browser cache that stores files from previous browsing...
Read more >
What Is Cached Data? Explore 3 Easy Ways to Clear It - Kinsta
An application might store some files in cached memory to load faster. Your WordPress website or web host might store files or copies...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found