question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimisation for the __contains__ method of storage.LRUStoreCache

See original GitHub issue

I have a few TB large dataset with 11 parameters and about 100000 chunks, and am storing it in azure blob using the ABSStore mutable mapping. When I do zarr.open_group(store=store, mode='r') with store as zarr.LRUStoreCache(max_size=2**33, store=zarr.storage.ABSStore('testcontainer', 'mydataset', BLOB_ACCOUNT_NAME, BLOB_ACCOUNT_KEY)), it takes about 45 seconds to open the group. Without the LRU wrapper the open_group operation is instantaneous. I traced the problem to the __contains__ method in LRUStoreCache mutable mapping wrapper(open_group calls the contains_array method). The __contains__ method(here) in LRUStoreCache is implemented by listing all the keys in the mutable mapping of the underlying store, and therefore, all 100000 chunks are listed before checking for existence. In the context of cloud storage this can cause significant overhead.

This is the method of LRUStoreCache now:

    def __contains__(self, key):
        with self._mutex:
            if self._contains_cache is None:
                self._contains_cache = set(self._keys())
            return key in self._contains_cache

when I changed it to this:

    def __contains__(self, key):
        return key in self._store

the open_group operation is almost instantaneous as the __contains__ method of the underlying ABSStore class uses the exists option on azure blob and so doesn’t have to list all keys.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:1
  • Comments:18 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
alimanfoocommented, Sep 3, 2018

Forgot to mention, there is another trick you can pull re caching of data versus metadata. If you have all data and metadata in the same store, you can still have separate caches for data and metadata. E.g.:

store = ... # some store with all data and metadata
data_cache = zarr.LRUStoreCache(store)
meta_cache = zarr.LRUStoreCache(store)
root = zarr.Group(store=meta_cache, chunk_store=data_cache)

Not sure how practically useful, but does at least provide a way to ensure data reads don’t push metadata out of the cache.

0reactions
alimanfoocommented, Sep 3, 2018

Yes let’s get #268 done, I’m still catching up after summer holidays but working my way towards it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Release notes — zarr 2.11.0 documentation
New LRUStoreCache class. The class zarr.storage.LRUStoreCache has been added and provides a means to locally cache data in memory from a store that...
Read more >
D5.4: Integrated Data Management Workflows - Zenodo
instance netCDF climate data contains metadata that are embedded in the data itself);. - the storage and computing resources necessary for ...
Read more >
tifffile 2022.3.16 - PyPI
Add method to export fsspec ReferenceFileSystem from ZarrFileStore. ... STK (MetaMorph Stack) contains additional image planes stored ...
Read more >
zarr.open_array Example - Program Talk
This function carries out spatial update of the CNMF algorithm. ... Note that if `in_memory` is `False`, then `C` must be stored under...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found