question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

StoreCache, a storage class to add arbitrary caches to existing stores

See original GitHub issue

I’ve implemented a simple StoreCache which is similar to LRUStoreCache but accepts a cache object to use instead of implementing the cache internally. This was created to meet a requirement to cache on disk instead of in memory. It can be composed with LRUStoreCache to enable layered caching.

StoreCache enables a variety of off the shelf caches to be used as a chunk cache. The example script below tests DiskCache, the builtin dbm, and with some slight modification cachey. For each cache, two requests are made to a Zarr dataset on S3:

import dbm
from timeit import timeit
import tempfile

import cachey
import diskcache
import s3fs
import zarr
from zarr.storage import StoreCache

s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
uncached_store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)

disk_cache_store = StoreCache(uncached_store, diskcache.Cache())
dict_cache_store = StoreCache(uncached_store, dict())
dbm_cache_store = StoreCache(uncached_store, dbm.open(tempfile.TemporaryDirectory().name, flag='c'))

# modify a cache which doesn't support item assignment
class CacheyStoreCache(cachey.Cache):
    def __setitem__(self, k, v):
        self.put(k, v, cost=len(v))

cachey_store = StoreCache(uncached_store, CacheyStoreCache(2**20))

def benchmark(cache_type, store):
    root = zarr.group(store=store)
    f = lambda: root["foo/bar/baz"][:]
    t1 = timeit(f, number=1)
    t2 = timeit(f, number=1)
    print(f'{cache_type}\nt1: {t1}\nt2: {t2}')


benchmark('disk', disk_cache_store)
benchmark('dict', dict_cache_store)
benchmark('dbm', dbm_cache_store)
benchmark('cachey', cachey_store)
benchmark('uncached', uncached_store)

Results from my laptop:

disk
t1: 1.8804557690000365
t2: 0.0005672639999829698
dict
t1: 1.9392649410001468
t2: 0.0005628700000670506
dbm
t1: 2.0158668909998596
t2: 0.0004138780000175757
cachey
t1: 1.766250748999937
t2: 0.0004628369997590198
uncached
t1: 2.013281605999964
t2: 1.9759216810002727

A sufficient amount of the interface was implemented to allow it to pass the test suite:

class StoreCache(MutableMapping):

    def __init__(self, store, cache):
        self._cache = cache
        self._store = store

    def __getitem__(self, key):
        value = self._cache.get(key)
        if not value:
            value = self._store.__getitem__(key)
            self._cache[key] = value
        return value

    def __iter__(self):
        return self._store.__iter__()

    def __len__(self):
        return self._store.__len__()

    def __contains__(self, key):
        return key in self._cache or self._store.__contains__(key)

    def keys(self):
        return self._store.keys()

    def __delitem__(self, key):
        self._store.__delitem__(key)
        del self._cache[key]

    def __setitem__(self, key, value):
        self._store.__setitem__(key, value)
        self._cache[key] = value

    def items(self):
        return self._store.items()

    def values(self):
        return self._store.values()

If this would be useful then I can submit a pull request. Thanks!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
jsheedycommented, Sep 22, 2021

That’s a great post, leveraging CDNs for lower level things like Zarr chunks could be interesting. Our needs are totally aligned on I should never have to pull the data across the cloud boundary more than once in a single session. Are you looking for a general solution beyond Zarr chunks?

I haven’t written anything up about this project yet, but Pangeo Showcase sounds like a good forum in which to share. I’ll check in on my end if it would be OK.

1reaction
jakirkhamcommented, Nov 12, 2021

The other thing that might be worth playing with is the LRUStoreCache, which was created for this purpose (data crossing cloud boundaries only once). There’s an example in the docs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Designing the Caching system
In computing, a cache is a high-speed data storage layer which stores a ... Caching allows you to efficiently reuse previously retrieved or...
Read more >
Interacting with cached data - Apollo GraphQL Docs
In addition to reading arbitrary data from the Apollo Client cache, you can write arbitrary data to the cache with the writeQuery and...
Read more >
Caching with Rails: An overview
Some cache store implementations, like MemoryStore, are able to cache arbitrary Ruby objects, but don't count on every cache store to be able...
Read more >
Caching in Python Using the LRU Cache Strategy
Caching is an optimization technique that you can use in your applications to keep recent or often-used data in memory locations that are...
Read more >
29. Cache Abstraction
Similar to the transaction support, the caching abstraction allows consistent use of various caching solutions with minimal impact on the code.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found