Optimisation for the __contains__ method of storage.LRUStoreCache
See original GitHub issueI have a few TB large dataset with 11 parameters and about 100000 chunks, and am storing it in azure blob using the ABSStore
mutable mapping. When I do zarr.open_group(store=store, mode='r')
with store as zarr.LRUStoreCache(max_size=2**33, store=zarr.storage.ABSStore('testcontainer', 'mydataset', BLOB_ACCOUNT_NAME, BLOB_ACCOUNT_KEY))
, it takes about 45 seconds to open the group. Without the LRU
wrapper the open_group
operation is instantaneous. I traced the problem to the __contains__
method in LRUStoreCache
mutable mapping wrapper(open_group
calls the contains_array
method). The __contains__
method(here) in LRUStoreCache
is implemented by listing all the keys in the mutable mapping of the underlying store, and therefore, all 100000 chunks are listed before checking for existence. In the context of cloud storage this can cause significant overhead.
This is the method of LRUStoreCache
now:
def __contains__(self, key):
with self._mutex:
if self._contains_cache is None:
self._contains_cache = set(self._keys())
return key in self._contains_cache
when I changed it to this:
def __contains__(self, key):
return key in self._store
the open_group
operation is almost instantaneous as the __contains__
method of the underlying ABSStore
class uses the exists
option on azure blob and so doesn’t have to list all keys.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:18 (18 by maintainers)
Forgot to mention, there is another trick you can pull re caching of data versus metadata. If you have all data and metadata in the same store, you can still have separate caches for data and metadata. E.g.:
Not sure how practically useful, but does at least provide a way to ensure data reads don’t push metadata out of the cache.
Yes let’s get #268 done, I’m still catching up after summer holidays but working my way towards it.