Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimisation for the contains method of storage.LRUStoreCache

See original GitHub issue

I have a few TB large dataset with 11 parameters and about 100000 chunks, and am storing it in azure blob using the ABSStore mutable mapping. When I do zarr.open_group(store=store, mode='r') with store as zarr.LRUStoreCache(max_size=2**33, store=zarr.storage.ABSStore('testcontainer', 'mydataset', BLOB_ACCOUNT_NAME, BLOB_ACCOUNT_KEY)), it takes about 45 seconds to open the group. Without the LRU wrapper the open_group operation is instantaneous. I traced the problem to the __contains__ method in LRUStoreCache mutable mapping wrapper(open_group calls the contains_array method). The __contains__ method(here) in LRUStoreCache is implemented by listing all the keys in the mutable mapping of the underlying store, and therefore, all 100000 chunks are listed before checking for existence. In the context of cloud storage this can cause significant overhead.

This is the method of LRUStoreCache now:

    def __contains__(self, key):
        with self._mutex:
            if self._contains_cache is None:
                self._contains_cache = set(self._keys())
            return key in self._contains_cache

when I changed it to this:

    def __contains__(self, key):
        return key in self._store

the open_group operation is almost instantaneous as the __contains__ method of the underlying ABSStore class uses the exists option on azure blob and so doesn’t have to list all keys.

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:18 (18 by maintainers)

Top GitHub Comments

1reaction

alimanfoocommented, Sep 3, 2018

Forgot to mention, there is another trick you can pull re caching of data versus metadata. If you have all data and metadata in the same store, you can still have separate caches for data and metadata. E.g.:

store = ... # some store with all data and metadata
data_cache = zarr.LRUStoreCache(store)
meta_cache = zarr.LRUStoreCache(store)
root = zarr.Group(store=meta_cache, chunk_store=data_cache)

Not sure how practically useful, but does at least provide a way to ensure data reads don’t push metadata out of the cache.

0reactions

alimanfoocommented, Sep 3, 2018

Yes let’s get #268 done, I’m still catching up after summer holidays but working my way towards it.

Top Results From Across the Web

Release notes — zarr 2.11.0 documentation

New LRUStoreCache class. The class zarr.storage.LRUStoreCache has been added and provides a means to locally cache data in memory from a store that...

D5.4: Integrated Data Management Workflows - Zenodo

instance netCDF climate data contains metadata that are embedded in the data itself);. - the storage and computing resources necessary for ...

tifffile 2022.3.16 - PyPI

Add method to export fsspec ReferenceFileSystem from ZarrFileStore. ... STK (MetaMorph Stack) contains additional image planes stored ...

zarr.open_array Example - Program Talk

This function carries out spatial update of the CNMF algorithm. ... Note that if `in_memory` is `False`, then `C` must be stored under...