Refactor storage around abstract file system?
See original GitHub issueWe are recently seeing a lot of new proposals for new storage classes in zarr (e.g. #299, #294, #293, #252). These are all great ideas. Alternatively, we have several working storage layers (s3fs, gcsfs) that don’t live inside zarr because they already provide a MutableMapping interface that zarr can talk to. The situation is fragmented, and we don’t see to have a clear roadmap for how to handle all these different scenarios. There is some relevant discussion in #290.
I recently learned about pyfilesystem: “PyFilesystem is a Python module that provides a common interface to any filesystem.” The index of supported filesystems provides analogs for nearly all of the builtin zarr storage options. Plus there are storage classes for cloud, ftp, dropbox, etc.
Perhaps one path forward would be to refactor zarr’s storage to use pyfilesystem objects. We would only really need a single storage class which wraps pyfilesystem and provides the MutableMapping that zarr uses internally. Then we could remove 80% of storage.py
that deals with listing directories, zip files, etc, since this would be handled by pyfilesystem.
Once we had a generic filesystem, we could then create a Layout layer, which describes how the zarr objects are laid out within the filesystem. For example, today, we already have two de-facto layouts: DirectoryStore
and NestedDirectoryStore
. We could consider others. For example, one with all the metadata in a single file (e.g. #294). The Layout and the Filesystem could be independent from one another.
For new storage layers like mongodb, redis, etc., we would basically just say, “go implement a pyfilesystem for that”. This has the advantage of
- reducing the maintenance burden in zarr
- providing more general filesystem objects (that can also be used outside of zarr)
The only con I can think of is performance: it is possible that the pyfilesystem implementations could have worse performance than the zarr built-in ones. But this cuts both ways: they could also perform better!
I know this implies a fairly big refactor of zarr. But it could save us lots of headaches in the long run.
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Thanks Ryan, pyfilesystem certainly looks like something we should investigate. And I am certainly open to this general approach if the underlying libraries are well maintained and performant and we can still optimise for zarr usage patterns if/where needed.
Just to note here that this approach is basically what @martindurant has been arguing for, albeit with the underlying filesystem abstraction and implementations being different. @martindurant what’s your view of this?
FWIW I think it will take some time to get enough experience to make a firm decision in this direction, so I think we should be prepared to live with a mixture of approaches and some duplication of effort for a while. Obviously in the long run we should aim to consolidate efforts and remove redundancy as much as possible.
Also various people (including me) have found it pleasantly straightforward to implement the MutableMapping interface directly for a new storage backend, so we shouldn’t ignore those positive feelings. Maybe implementing the pyfilesystem API is similarly straightforward, I don’t have the experience.
On Sun, 23 Sep 2018, 20:28 Ryan Abernathey, notifications@github.com wrote:
So looking back: the goal of this issue would be roughly equivalent to making FSStore the default?