question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Refactor storage around abstract file system?

See original GitHub issue

We are recently seeing a lot of new proposals for new storage classes in zarr (e.g. #299, #294, #293, #252). These are all great ideas. Alternatively, we have several working storage layers (s3fs, gcsfs) that don’t live inside zarr because they already provide a MutableMapping interface that zarr can talk to. The situation is fragmented, and we don’t see to have a clear roadmap for how to handle all these different scenarios. There is some relevant discussion in #290.

I recently learned about pyfilesystem: “PyFilesystem is a Python module that provides a common interface to any filesystem.” The index of supported filesystems provides analogs for nearly all of the builtin zarr storage options. Plus there are storage classes for cloud, ftp, dropbox, etc.

Perhaps one path forward would be to refactor zarr’s storage to use pyfilesystem objects. We would only really need a single storage class which wraps pyfilesystem and provides the MutableMapping that zarr uses internally. Then we could remove 80% of storage.py that deals with listing directories, zip files, etc, since this would be handled by pyfilesystem.

Once we had a generic filesystem, we could then create a Layout layer, which describes how the zarr objects are laid out within the filesystem. For example, today, we already have two de-facto layouts: DirectoryStore and NestedDirectoryStore. We could consider others. For example, one with all the metadata in a single file (e.g. #294). The Layout and the Filesystem could be independent from one another.

For new storage layers like mongodb, redis, etc., we would basically just say, “go implement a pyfilesystem for that”. This has the advantage of

  • reducing the maintenance burden in zarr
  • providing more general filesystem objects (that can also be used outside of zarr)

The only con I can think of is performance: it is possible that the pyfilesystem implementations could have worse performance than the zarr built-in ones. But this cuts both ways: they could also perform better!

I know this implies a fairly big refactor of zarr. But it could save us lots of headaches in the long run.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
alimanfoocommented, Sep 23, 2018

Thanks Ryan, pyfilesystem certainly looks like something we should investigate. And I am certainly open to this general approach if the underlying libraries are well maintained and performant and we can still optimise for zarr usage patterns if/where needed.

Just to note here that this approach is basically what @martindurant has been arguing for, albeit with the underlying filesystem abstraction and implementations being different. @martindurant what’s your view of this?

FWIW I think it will take some time to get enough experience to make a firm decision in this direction, so I think we should be prepared to live with a mixture of approaches and some duplication of effort for a while. Obviously in the long run we should aim to consolidate efforts and remove redundancy as much as possible.

Also various people (including me) have found it pleasantly straightforward to implement the MutableMapping interface directly for a new storage backend, so we shouldn’t ignore those positive feelings. Maybe implementing the pyfilesystem API is similarly straightforward, I don’t have the experience.

On Sun, 23 Sep 2018, 20:28 Ryan Abernathey, notifications@github.com wrote:

We are recently seeing a lot of new proposals for new storage classes in zarr (e.g. #299 https://github.com/zarr-developers/zarr/issues/299, #294 https://github.com/zarr-developers/zarr/issues/294, #293 https://github.com/zarr-developers/zarr/pull/293, #252 https://github.com/zarr-developers/zarr/pull/252). These are all great ideas. Alternatively, we have several working storage layers (s3fs, gcsfs) that don’t live inside zarr because they already provide a MutableMapping interface that zarr can talk to. The situation is fragmented, and we don’t see to have a clear roadmap for how to handle all these different scenarios. There is some relevant discussion in #290 https://github.com/zarr-developers/zarr/issues/290.

I recently learned about pyfilesystem https://www.pyfilesystem.org/: “PyFilesystem is a Python module that provides a common interface to any filesystem.” The index of supported filesystems https://www.pyfilesystem.org/page/index-of-filesystems/ provides analogs for nearly all of the builtin zarr storage options. Plus there are storage classes for cloud, ftp, dropbox, etc.

Perhaps one path forward would be to refactor zarr’s storage to use pyfilesystem objects. We would only really need a single storage class which wraps pyfilesystem and provides the MutableMapping that zarr uses internally. Then we could remove 80% of storage.py that deals with listing directories, zip files, etc, since this would be handled by pyfilesystem.

Once we had a generic filesystem, we could then create a Layout layer, which describes how the zarr objects are laid out within the filesystem. For example, today, we already have two de-facto layouts: DirectoryStore and NestedDirectoryStore. We could consider others. For example, one with all the metadata in a single file (e.g. #294 https://github.com/zarr-developers/zarr/issues/294). The Layout and the Filesystem could be independent from one another.

For new storage layers like mongodb, redis, etc., we would basically just say, “go implement a pyfilesystem for that”. This has the advantage of

  • reducing the maintenance burden in zarr
  • providing more general filesystem objects (that can also be used outside of zarr)

The only con I can think of is performance: it is possible that the pyfilesystem implementations could have worse performance than the zarr built-in ones. But this cuts both ways: they could also perform better!

I know this implies a fairly big refactor of zarr. But it could save us lots of headaches in the long run.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/301, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qv6wTqlhuTxwdzWk5Pw0CzSKG0gQks5ud9LmgaJpZM4W10WR .

0reactions
joshmoorecommented, Sep 22, 2021

So looking back: the goal of this issue would be roughly equivalent to making FSStore the default?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Modernizing File System through In-Storage Indexing - USENIX
file systems abstract these blocks into files and directories con- taining user data by managing the storage space (e.g., bitmaps).
Read more >
Today: File System Functionality File System Abstraction - LASS
– Create a file descriptor for the file including name, location on disk, and all file attributes. – Add the file descriptor to...
Read more >
Turn Your Storage Stack into a File System - Washington
Abstract. Storage hardware trends suggest a rethink of file system design. Current and future server architectures have a.
Read more >
spf13/afero: A FileSystem Abstraction System for Go - GitHub
The first is simply a wrapper around the native OS calls. This makes it very easy to use as all of the calls...
Read more >
The Design and Implementation of a Log-Structured File System
Abstract. This paper presents a new technique for disk storage management called a log-structured file system. A log- structured file system ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found