Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

StoreCache, a storage class to add arbitrary caches to existing stores

See original GitHub issue

I’ve implemented a simple StoreCache which is similar to LRUStoreCache but accepts a cache object to use instead of implementing the cache internally. This was created to meet a requirement to cache on disk instead of in memory. It can be composed with LRUStoreCache to enable layered caching.

StoreCache enables a variety of off the shelf caches to be used as a chunk cache. The example script below tests DiskCache, the builtin dbm, and with some slight modification cachey. For each cache, two requests are made to a Zarr dataset on S3:

import dbm
from timeit import timeit
import tempfile

import cachey
import diskcache
import s3fs
import zarr
from zarr.storage import StoreCache

s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
uncached_store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)

disk_cache_store = StoreCache(uncached_store, diskcache.Cache())
dict_cache_store = StoreCache(uncached_store, dict())
dbm_cache_store = StoreCache(uncached_store, dbm.open(tempfile.TemporaryDirectory().name, flag='c'))

# modify a cache which doesn't support item assignment
class CacheyStoreCache(cachey.Cache):
    def __setitem__(self, k, v):
        self.put(k, v, cost=len(v))

cachey_store = StoreCache(uncached_store, CacheyStoreCache(2**20))

def benchmark(cache_type, store):
    root = zarr.group(store=store)
    f = lambda: root["foo/bar/baz"][:]
    t1 = timeit(f, number=1)
    t2 = timeit(f, number=1)
    print(f'{cache_type}\nt1: {t1}\nt2: {t2}')


benchmark('disk', disk_cache_store)
benchmark('dict', dict_cache_store)
benchmark('dbm', dbm_cache_store)
benchmark('cachey', cachey_store)
benchmark('uncached', uncached_store)

Results from my laptop:

disk
t1: 1.8804557690000365
t2: 0.0005672639999829698
dict
t1: 1.9392649410001468
t2: 0.0005628700000670506
dbm
t1: 2.0158668909998596
t2: 0.0004138780000175757
cachey
t1: 1.766250748999937
t2: 0.0004628369997590198
uncached
t1: 2.013281605999964
t2: 1.9759216810002727

A sufficient amount of the interface was implemented to allow it to pass the test suite:

class StoreCache(MutableMapping):

    def __init__(self, store, cache):
        self._cache = cache
        self._store = store

    def __getitem__(self, key):
        value = self._cache.get(key)
        if not value:
            value = self._store.__getitem__(key)
            self._cache[key] = value
        return value

    def __iter__(self):
        return self._store.__iter__()

    def __len__(self):
        return self._store.__len__()

    def __contains__(self, key):
        return key in self._cache or self._store.__contains__(key)

    def keys(self):
        return self._store.keys()

    def __delitem__(self, key):
        self._store.__delitem__(key)
        del self._cache[key]

    def __setitem__(self, key, value):
        self._store.__setitem__(key, value)
        self._cache[key] = value

    def items(self):
        return self._store.items()

    def values(self):
        return self._store.values()

If this would be useful then I can submit a pull request. Thanks!

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:7 (4 by maintainers)

Top GitHub Comments

2reactions

jsheedycommented, Sep 22, 2021

That’s a great post, leveraging CDNs for lower level things like Zarr chunks could be interesting. Our needs are totally aligned on I should never have to pull the data across the cloud boundary more than once in a single session. Are you looking for a general solution beyond Zarr chunks?

I haven’t written anything up about this project yet, but Pangeo Showcase sounds like a good forum in which to share. I’ll check in on my end if it would be OK.

1reaction

jakirkhamcommented, Nov 12, 2021

The other thing that might be worth playing with is the LRUStoreCache, which was created for this purpose (data crossing cloud boundaries only once). There’s an example in the docs.