[RFC] persistent filesystem id
See original GitHub issuefsspec currently uses fs._fs_token
that is based on (among a few other things) thread id and args/kwargs passed to the filesystem constructor. This token is used to compare filesystems for caching.
In dvc, we have persistent storage for metadata associated with a particular file on a particular filesystem (e.g. we are saving that path/to/file
has an md5
that is 123456
and use fs.checksum()
for verification when loading next time). Currently we only store this for local files, but we need to support that for remote files as well, so in our experimentation, we’ve introduced fs_id
, which is a hash of a config dictionary that was used for creating a specific fs instance, which is very similar to _fs_token
, except that it doesn’t include tid
, so it is persistent across sessions.
In the simplest case (and probably by default), fs_id
(open for discussion on the naming) could just default to the hash of the whole config, but it is clear that many filesystems could do better than that. For example:
- for local, it might be something like a hash of
b"local"
. For other scenarios where more granularity is needed, this might be based on /etc/machine-id or alternative to be able to distinguish separate machines. Also, fsspec itself doesn’t distinguish real filesystems mounted to the root fs in the system, so that might also be a useful one to identify (e.g. maybe based on real fsid from fsstab), here PrefixFileSystem might come in handy too. - for s3,
fs_id
should really depend on things like endpoint/region/bucket/etc, and not on user creds. - for ssh,
fs_id
should depend on base url, but not on user creds.
fs_id
for a particular filesystem could change its algorithm in the future (e.g. if developers forgot to include some important parameter), so users should be ready for this and have some mechanism to recover from this. E.g. for dvc it will simply mean that old cached metadata will be effectively dropped, which is just a temporary inconvenience to recompute md5s for requested files.
I think this functionality could be handy for fsspec users in scenarios where they want to be able to identify filesystems across sessions and maybe have persistent storage of metadata. We’ve used to use hash of url
(e.g. s3://bucket/path) in some cases, but url is not able to account for all important parameters, so a standalone fs_id
would be much better. If there is interest in adopting it, we’ll be happy to contribute PRs for it. Would really appreciate any comments/suggestions.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
id
is reserved, no?But yes, I think your understanding is correct.
I agree a session-independent hash of a filesystem seems like a reasonable thing to want. Indeed, it is complicated by the fact that specifics (credentials, defaults) may depend on the environment, but even so. Having a method that can be overidden per filesystem would allow, as you say, for the constraining of which information is to be considered part of the filesystem’s identity.
Question: shoudl this be used as the hash for the existing instance cache? Are other state-dependent factors actually useful?