question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memmap reads from directory store

See original GitHub issue

Ive only recently started using zarr but im impressed. well done.

I want to share an experience and a possible enhancement. In one of my use cases i use vindex heavily across the whole array. I know this is likely a worst use case scenario as zarr is reading many many chunks for a small amount of data in each one. I was previously using numpy memmap arrays for a similar use and it was much faster so i wondered if i used an uncompressed DirectoryStore if it would read chunks as a memmap. no luck, still reading full chunks. So i had a go at subclassing DirectoryStore to do this.


class MemMapReadStore(zarr.DirectoryStore):
    """Directory store using MemMap for reading chunks
    """
    def __getitem__(self, key):
        filepath = os.path.join(self.path, key)
        if os.path.isfile(filepath):
            #are there only 2 types of files? .zarray and the chunks?
            if key == '.zarray':
                with open(filepath, 'rb') as f:
                    return f.read()
            else:
                return np.memmap(filepath,mode='r')
        else:
            raise KeyError(key)

Its working well for me but I dont really know the inner workings of zarr so who knows what i might have broken and other features it wont play well with. I thought the idea might be a basis for an enhancement though. Worth sharing at least.

Speed up depends on access pattern, compression etc but for the example im testing im seeing 22 times speed up v a compressed zarr array of the same dimensions and chunking.

Its only working for reads as that was all i needed and i see the way you write replaces the whole chunk so memmap writes are not doable.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
jakirkhamcommented, Jan 4, 2019

Put together PR ( https://github.com/zarr-developers/zarr/pull/377 ), which adds the memmap option so we can further the discussion by looking at an implementation.

0reactions
jakirkhamcommented, Nov 3, 2022

Since PR ( https://github.com/zarr-developers/zarr-python/pull/377 ) was opened, we added PR ( https://github.com/zarr-developers/zarr-python/pull/503 ), which allows users to customize how reading occurs by overriding the staticmethod _fromfile of DirectoryStore. For example:

class MemoryMappedDirectoryStore(DirectoryStore):
    def _fromfile(self, fn):
        with open(fn, "rb") as fh:
            return memoryview(mmap.mmap(fh.fileno(), 0, prot=mmap.PROT_READ))

This store can then be used with Groups and Arrays.

Given a user can do this on their own easily, have turned this into a doc issue ( https://github.com/zarr-developers/zarr-python/issues/1245 ). Closing this out.

Read more comments on GitHub >

github_iconTop Results From Across the Web

memmap reads from directory store · Issue #265 - GitHub
DirectoryStore): """Directory store using MemMap for reading ... I use zarr for storing data that will be used for training neural networks.
Read more >
numpy.memmap — NumPy v1.24 Manual
Create a memory-map to an array stored in a binary file on disk. Memory-mapped files are used ... Create or overwrite existing file...
Read more >
NumPy Input and Output: memmap() function - w3resource
The memmap() function is used to create a memory-map to an array stored in a binary ... 'r+' Open existing file for reading...
Read more >
load np.memmap without knowing shape - Stack Overflow
Suppose I need to initialize 100 GB of data, and I only have 32 GB of RAM. In that situation, I am forced...
Read more >
mmap — Memory-mapped file support — Python 3.11.1 ...
You can also read and write data starting at the current file position, and seek() through the file to different positions. A memory-mapped...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found